Instead of having `Parser_getObj` fail unconditionally for the referenced PDF file, this patch attempts to let searching for the main trailer continue even if there are errors.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1250079.
This is similar to the existing `isCmd` and `isDict` functions, which already support similar kind of checks.
With the updated `isName` function, we'll be able to simplify many callsites from: `isName(someVariable) && someVariable.name === 'someName'` to: `isName(someVariable, 'someName')`.
This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2.
The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before *every* showText command.
Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.
[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items
Note that I used a separate warning message for this case, instead of utilizing the same one as in the unsupported subtype case, to more clearly indicate that the PDF file itself is to blame rather than PDF.js.
Fixes 7446.
Fonts that are not referenced by `Ref`s are very uncommon in practice, but it can unfortunately happen. In this case, we're currently not caching them in the usual way, i.e. by `Ref`, which leads to failures when a page is rendered after `cleanup` has run.
The simplest solution would have been to remove the `font.translated` workaround, but since this would have meant loading these kind of fonts over and over, the patch attempts to be a bit clever about this situation.
Note that if we instead loaded fonts per *page*, instead of per document, this issue wouldn't have existed.
Originally, I was just going to change this code to use `Ref_toString` in a couple more places. When I started reading the code, I figured that it wouldn't hurt to clean up a couple of comments. While doing this, I noticed that the logic for the (rare) `isDict(fontRef)` case could do with a few improvements.
There should be no functional changes with this patch, but given the added reference checks, we will now avoid bogus `Ref`s when resolving font aliases. In practice, as issue 7403 shows, the current code can break certain PDF files even if it's very rare.
Note that the only thing that this patch will change, is the `font.loadedName` in the case where a `fontRef` is a reference *and* the font doesn't have a descriptor. Previously for `fontRef = Ref(4, 0)` we'd get `font.loadedName = 'g_d0_f4_0'`, and with this patch `font.loadedName = g_d0_f4R`, which is actually one character shorted in most cases. (Given that `Ref_toString` contains an optimization for the `gen === 0` case, which is by far the most common `gen` value.)
In the already existing fallback case, where the `fontName` is used to when creating the `font.loadedName`, we allow any alphanumeric character. Hence I don't see how (as mentioned above) e.g. `font.loadedName = g_d0_f4R` would be an issue here.
From the discussion in issue 7445, it seems that there may be cases where an API consumer would want to get the text content as is, without combined text items.
After PR 7039, the PDF file in issue 7492 no longer renders at all, but note that text selection wasn't working correctly previously.
The problem with the PDF file in issue 7492 is that the `cMap`, in the `toUnicode` entry in the font, contains an invalid name:
```
/CMapName /-usr-share-fonts-truetype-Panton-Panton Family-Fontfabric - Panton.otf,000-UTF16 def
```
When we parse that line, things obviously break because there are spaces present in the wrong places.
To avoid that issue, the patch simply lets `parseCMap` continue when errors are encountered, to try and recover usable data. Note that by not aborting immediatly when an error is encountered, we are also able to fix the text selection.
Obviously, it could be argued that we should just immediatly reject a corrupt `cMap`. But given that they usually are correct, it seems that trying to recover as much data as possible from corrupt one can only be a good thing for both glyph mapping and text selection.
Fixes 7492.
In the PDF file in the issue, some of the glyphs end up being mapped to the Lepcha Unicode block; see https://en.wikipedia.org/wiki/Lepcha_(Unicode_block).
This didn't use to matter, but after HarfBuzz updates that improved support for Lepcha fonts, in particular https://bugzilla.mozilla.org/show_bug.cgi?id=1249861, some glyphs are now moved horizontally.
To avoid that, this patch adds the Lepcha block to the list of Unicode ranges that we skip when building the glyph mapping.
Fixes 7426.
After PR 7441, where `recoverGlyphName` is used a lot more than before, many PDF files will generate a lot of warnings the console. For normal usage, compared to debugging/development, this is probably more annoying than helpful.
Fallback to attempt to recover standard glyph names when amending the `charCodeToGlyphId` with entries from the `differences` array in `type1FontGlyphMapping` (issue 7439)
In fonts with only upper-case glyphs, that are also missing a space glyph, `get spaceWidth` won't be able to return anything useful.
By adding upper-case `I` as a fallback, we can thus improve text-selection in some PDF files.
Note that locally, the patch causes slight movement in a few existing `text` tests, but in my opinion this actually looks like slight improvements.
Fixes 7180.
Currently the `isSpace` utility function is a member of `Lexer`, which seems suboptimal, given that it's placed in `core/parser.js`. In practice, this means that in a number of `core/*.js` files we thus have an *otherwise* completely unnecessary dependency on `core/parser.js` for a one-line function.
Instead, this patch moves `isSpace` into `shared/util.js` which seems more appropriate for this kind of utility function. Not to mention that since all the affected `core/*.js` files already depends on `shared/util.js`, this doesn't incur any more file dependencies.
Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly.
This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.
Fixes http://www.pdf-archive.com/2013/09/30/file2/file2.pdf.
Note how it's not possible to show the various Popup Annotations in the above document.
To fix that, this patch lets the Popup inherit the flags of the parent, in the special case where the parent is `viewable` *and* the Popup is not.
In general, I don't think that a Popup must have the same flags set as the parent. However, it seems very strange to have a `viewable` parent annotation, and then not being able to view the Popup.
Annoyingly the PDF specification doesn't, as far as I can find, mention anything about how this case should be handled, but this patch seem consistent with the actual behaviour in Adobe Reader.
Currently for explicit destinations, compared to named destinations, we manually try to build a hash that often times is a quite poor representation of the *actual* destination. (Currently this only, kind of, works for `\XYZ` destinations.)
For PDF files using explicit destinations, this can make it difficult/impossible to obtain a link to a specific section of the document through the URL.
Note that in practice most PDF files, especially newer ones, use named destinations and these are thus unnaffected by this patch.
This patch also fixes an existing issue in `PDFLinkService_getDestinationHash`, where a named destination consisting of only a number would not be handled correctly.
With the added, and already existing, type checks in place for destinations, I really don't think that this patch exposes any "sensitive" internal destination code not already accessible through normal hash parameters.
*Please note:* Just trying to improve the algorithm that generates the hash is unfortunately not possible in general, since there are a number of cases where it will simply never work well.
- First of all, note that `getDestinationHash` currently relies on the `_pagesRefCache`, hence it's possible that the hash returned is empty during e.g. ranged/streamed loading of a PDF file.
- Second of all, the currently computed hash is actually dependent on the document rotation. With named destinations, the fetched internal destination array is rotational invariant (as it should be), but this will not hold in general for the hash. We can easily avoid this issue by using a stringified destination array.
- Third of all, note that according to the PDF specification[1], `GoToR` destinations may actually contain explicit destination arrays. Since we cannot really construct a hash in `annotation.js`, we currently have no good way to support those. Even though this case seems *very* rare in practice (I've not actually seen such a PDF file), it's in the specification, and this patch allows us to support that for "free".
---
[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G11.1951685
Currently the `getPageIndex` method will happily return `0`, even if the `Ref` parameter doesn't actually point to a proper /Page dictionary.
Having the API trust that the consumer is doing the right thing seems error-prone, hence this patch which adds a check for this case.
Given that the `Catalog_getPageIndex` method isn't used in any hot part of the codebase, this extra check shouldn't be a problem.
(Note: in the standard viewer, it is only ever used from `PDFLinkService_navigateTo` if a destination needs to be resolved during document loading, which isn't common enough to be an issue IMHO.)
These have been found using `gulp lint` in combination with the `unused:
true` parameter for JSHint. Unfortunately there are too many false
positives to enable this feature, but now that most globals have been
removed because of the conversion to UMD the results are much more
useful than before.
In the font in question, there are a couple of `topDict` entries that have invalid values (`0xF 0xF`, i.e. just eof markers without any actual numbers).
This causes the `parseFloatOperand` function, inside `CFFParser_parseDict`, to return `NaN`. Currently we pass this broken font onto the browser, which OTS unsurprisingly rejects.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1068432.
As evident from e.g. PRs 6485 and 7118, some bad PDF generators unfortunately create Arrays where *some* elements are indirect objects (i.e. `Ref`s). This seems to mostly affect Arrays that contain numbers, such as e.g. `Matrix/FontMatrix/BBox/FontBBox/Rect/Color/...`, and has manifested itself in PDF files that fail to render correctly (some elements are missing).
The problem in both the cases above, besides broken rendering, was that there were *no* errors/warnings that indicated what the problem was, making it difficult to pinpoint the issue.
Hence this patch, where I've audited all usages of `Dict_get` in `src/core/` files, and replaced it with `Dict_getArray` where appropriate to try and prevent unnecessary future bugs.
This naming issue has been present since PR 3529, but at least I cannot find any issues/bugs that seem to have been caused by it, which is good.
The patch also removes an unnecessary `else` branch, since an already existing `break` means that it's redundant.
Fixes 7232.
- Add support for the 'NewWindow' property.
- Ensure that destinations are applied to the *remote* document, instead of the current one.
- Handle the `F` entry being a standard string, instead of a dictionary.
Using `new {Name,Cmd}` should be avoided, since it creates a new object on *every* call, whereas `{Name,Cmd}.get` uses caches to only create *one* object regardless of how many times they are called.
Most of these are found in the unit-tests, where increased memory usage probably doesn't matter very much. But it still seems good to get rid of those cases, since no part of the codebase ought to advertise that usage.
Given the small size of the patch, I'm also tweaking a few comments and class names.
Some browsers render certain special characters with width 0, others with strictly positive width. (For example, the Greek Delta, Δ, has width 0 in Google Chrome, and a positive width in Firefox.) The `if` clause in operation so far results in different text layer DOM trees for different browsers.
This commit fixes that by adding the elements independently of their width.
Note that in the PDF files provided by the reporter, this issue was limited to `Rect` arrays in AcroForm entries (which we currently don't support).
However, since a bad PDF generator could create this problem in *any* kind of annotation, the reduced test-case included here uses a simple LinkAnnotation instead.
Fixes 7115.
According to "The table directory" under https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6.html#Directory, TrueType font tables should have `uint32` checksums.
This is something that I noticed, and was initially confused about, while debugging a TrueType issue.
As far as I can tell, the current (`int32`) checksums we use doesn't cause any issues in practice. However, I do think that this should be addressed to agree with the specification, and to reduce possible confusion when reading the font code.
Currently there's a lot of duplicate code for non-embedded `toFontChar`, which this patch simplifies by extracting the code into a helper function instead.
This patch adds a `getUnicodeForGlyph` helper function, which is used to recover Unicode values for non-standard glyph names.
Some PDF generators, e.g. Scribus PDF, use improper `uniXXXX` glyph names which breaks the glyph mapping. We can avoid this by converting them to "standard" glyph names instead.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1132849.
Fixes 6893.
Fixes 6894.
In the PDF file in question, some of the 'name' table entries have `record.length === 0`. This becomes problematic in the non-unicode case, since `font.getBytes(0)` will fetch the *entire* stream.
Given that OTS rejects 'name' entries larger than `2^16`, this thus explain the sanitizer errors.
Fixes 7020.
This is required to be able to use it in the annotation display code,
where we now apply it to sanitize the filename of the FileAttachment
annotation. The PDF file from https://bugzilla.mozilla.org/show_bug.cgi?id=1230933 has shown that some PDF generators include the path of the file rather than the filename, which causes filenames with weird initial characters. PDF viewers handle this differently (for example Foxit Reader just replaces forward slashes with spaces), but we think it's better to only show the filename as intended.
Additionally we add unit tests for the `getFilenameFromUrl` helper
function.
*A more robust solution for issue 6066.*
As a temporary work-around for (the upstream) [bug 1164199](https://bugzilla.mozilla.org/show_bug.cgi?id=1164199), we parsed *all* images in the Firefox addon during a short time.
Doing so uncovered an issue with our image handling (see 6066), for JPEG images with a `DeviceGray` ColorSpace *and* `bpc !== 1` (bits per component).
As long as we let the browser handle image decoding in this case, this isn't going to be an issue, but I do think that we should proactively fix this to avoid future issues if we change where the images are decoded (in `jpg.js` vs in browser).
Also, we currently don't seem to have a test-case for that kind of image data.
Currently the `C` entry in an outline item is returned as is, which is neither particularly useful nor what the API documentation claims.
This patch also adds unit-tests for both the color handling, and the `F` entry (bold/italic flags).
`Dict_getAll` is problematic for a number of reasons. First of all, as issue 6961 shows, it can be really bad for performance, since it dereferences all indirect objects.
Second of all, all the derefencing can lead to data being unncessarily requested when ranged/chunked loading is used, thus unnecessarily delaying rendering.
Note: For cases where `Dict_getAll` was previously used, `Dict_getKeys` in combination with `Dict_get` can be used instead. This has the advantage that data isn't requested until it's actually needed.
For the operators that we currently support, the arguments are not `Dict`s, which means that it's not really necessary to use `Dict_getAll` in `EvaluatorPreprocessor_read`.
Also, I do think that if/when we support operators that use `Dict`s as arguments, that should be dealt with in the corresponding `case` in `PartialEvaluator_getOperatorList` which handles the operator.
The only reason that I can find for using `Dict_getAll` like that, is that prior to PR 6550 we would just append certain (currently unsupported) operators without doing any further processing/checking. But as issue 6549 showed, that can lead to issues in practice, which is why it was changed.
In an effort to prevent possible issue with unsupported operators, this patch simply ignores operators with `Dict` arguments in `PartialEvaluator_getOperatorList`.
For the `CalGray`/`CalRGB`/`Lab` colour spaces, we're currently using `getAll` to retrieve the parameters. However that's not really necessary, since we may just as well explicitly `get` the needed parameters instead.
Some bad PDF generators, in particular "Scribus PDF", duplicates resources *a lot* at various levels of the PDF files. This can lead to `PartialEvaluator_hasBlendModes` taking an unreasonable amount of time to complete.
The reason is that the current code is using `Dict_getAll`, which recursively dereferences *all* indirect objects, which can be really slow. This patch instead uses `Dict_getKeys`, and then manually looks up only the necessary indirect objects.
I've added the PDF file as a `load` test. The most important thing here is probably to ensure that the file remains available in the repo, and the comment should help reduced the chance of regressions. (Note that locally, the `load` test times out without this patch, but we cannot really assume that that always happens.)
Fixes 6961.
Reverts "Hack to avoid intermidiate Chrome failures during tests."
(2b2c521213).
require.js uses importScript asynchronously, which activates the worker
GC bug in WebKit. This patch works around a bug in a way that is similar
in the upcoming (but not yet released) require.js 2.1.23
The advantage of the new work-around is that it allows the runtime to
garbage-collect idle Workers.
References:
- https://crbug.com/572225
- https://webkit.org/b/153317
*This patch is based on something I noticed while debugging some of the PDF files in issue 6931.*
In a number of the cases in `setGState`, we're implicitly assuming that we're not dealing with indirect objects (i.e. `Ref`s). See e.g. the 'Font' case, or the various cases where we simply do `gStateObj.push([key, value]);` (since the code in `canvas.js` won't be able to deal with a `Ref` for those cases).
The reason that I didn't use `Dict_forEach` instead, is that it would re-introduce the unncessary closures that PR 5205 removed.
The intention of PR 5192 was to avoid adding empty `setGState` ops to the operatorList. But the patch accidentally used `>=`, which means that it's not actually working as intended, since empty arrays always have `length === 0`.
Even though the currently known test-cases render correctly without this patch, that seems more like a lucky coincidence, given that there's no guarantee that `transferMap[255] === 0` for every possible transfer function.
This patch fixes an issue that I inadvertently introduced in PR 5815, where we accidentally modify the `Differences` array in the encoding dictionary for indirect objects.
Instead of this change, we could also have used the now existing `Dict_getArray`. However in this case I don't think that would have been a good idea, since it would mean iterating through the array *twice*.