For `PartialEvaluator_getTextContent`, the same `args` Array should be re-used for every `EvaluatorPreprocessor_read` call. Hence we want to ensure that it's not accidentally replaced with `null` in `EvaluatorPreprocessor_read`, since otherwise corrupt PDF files (with too few arguments for certain commands) will cause errors in `PartialEvaluator_getTextContent`.
Perhaps a micro-optimization, but this patch also changes two `!args` comparisons to `args === null`, since that should be a tiny bit more efficient.
While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:
[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ
getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.
Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry.
Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data.
According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts.
There's obviously a risk that fixing this particular issue *could* break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage.
Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact *not* Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat.
Fixes 7580.
Note that in order to prevent any possible issues, this patch does *not* try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary.
Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.
This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2.
The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before *every* showText command.
Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.
[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items
Fonts that are not referenced by `Ref`s are very uncommon in practice, but it can unfortunately happen. In this case, we're currently not caching them in the usual way, i.e. by `Ref`, which leads to failures when a page is rendered after `cleanup` has run.
The simplest solution would have been to remove the `font.translated` workaround, but since this would have meant loading these kind of fonts over and over, the patch attempts to be a bit clever about this situation.
Note that if we instead loaded fonts per *page*, instead of per document, this issue wouldn't have existed.
Originally, I was just going to change this code to use `Ref_toString` in a couple more places. When I started reading the code, I figured that it wouldn't hurt to clean up a couple of comments. While doing this, I noticed that the logic for the (rare) `isDict(fontRef)` case could do with a few improvements.
There should be no functional changes with this patch, but given the added reference checks, we will now avoid bogus `Ref`s when resolving font aliases. In practice, as issue 7403 shows, the current code can break certain PDF files even if it's very rare.
Note that the only thing that this patch will change, is the `font.loadedName` in the case where a `fontRef` is a reference *and* the font doesn't have a descriptor. Previously for `fontRef = Ref(4, 0)` we'd get `font.loadedName = 'g_d0_f4_0'`, and with this patch `font.loadedName = g_d0_f4R`, which is actually one character shorted in most cases. (Given that `Ref_toString` contains an optimization for the `gen === 0` case, which is by far the most common `gen` value.)
In the already existing fallback case, where the `fontName` is used to when creating the `font.loadedName`, we allow any alphanumeric character. Hence I don't see how (as mentioned above) e.g. `font.loadedName = g_d0_f4R` would be an issue here.
From the discussion in issue 7445, it seems that there may be cases where an API consumer would want to get the text content as is, without combined text items.
Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly.
This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.
As evident from e.g. PRs 6485 and 7118, some bad PDF generators unfortunately create Arrays where *some* elements are indirect objects (i.e. `Ref`s). This seems to mostly affect Arrays that contain numbers, such as e.g. `Matrix/FontMatrix/BBox/FontBBox/Rect/Color/...`, and has manifested itself in PDF files that fail to render correctly (some elements are missing).
The problem in both the cases above, besides broken rendering, was that there were *no* errors/warnings that indicated what the problem was, making it difficult to pinpoint the issue.
Hence this patch, where I've audited all usages of `Dict_get` in `src/core/` files, and replaced it with `Dict_getArray` where appropriate to try and prevent unnecessary future bugs.
Using `new {Name,Cmd}` should be avoided, since it creates a new object on *every* call, whereas `{Name,Cmd}.get` uses caches to only create *one* object regardless of how many times they are called.
Most of these are found in the unit-tests, where increased memory usage probably doesn't matter very much. But it still seems good to get rid of those cases, since no part of the codebase ought to advertise that usage.
Given the small size of the patch, I'm also tweaking a few comments and class names.
For the operators that we currently support, the arguments are not `Dict`s, which means that it's not really necessary to use `Dict_getAll` in `EvaluatorPreprocessor_read`.
Also, I do think that if/when we support operators that use `Dict`s as arguments, that should be dealt with in the corresponding `case` in `PartialEvaluator_getOperatorList` which handles the operator.
The only reason that I can find for using `Dict_getAll` like that, is that prior to PR 6550 we would just append certain (currently unsupported) operators without doing any further processing/checking. But as issue 6549 showed, that can lead to issues in practice, which is why it was changed.
In an effort to prevent possible issue with unsupported operators, this patch simply ignores operators with `Dict` arguments in `PartialEvaluator_getOperatorList`.
Some bad PDF generators, in particular "Scribus PDF", duplicates resources *a lot* at various levels of the PDF files. This can lead to `PartialEvaluator_hasBlendModes` taking an unreasonable amount of time to complete.
The reason is that the current code is using `Dict_getAll`, which recursively dereferences *all* indirect objects, which can be really slow. This patch instead uses `Dict_getKeys`, and then manually looks up only the necessary indirect objects.
I've added the PDF file as a `load` test. The most important thing here is probably to ensure that the file remains available in the repo, and the comment should help reduced the chance of regressions. (Note that locally, the `load` test times out without this patch, but we cannot really assume that that always happens.)
Fixes 6961.
*This patch is based on something I noticed while debugging some of the PDF files in issue 6931.*
In a number of the cases in `setGState`, we're implicitly assuming that we're not dealing with indirect objects (i.e. `Ref`s). See e.g. the 'Font' case, or the various cases where we simply do `gStateObj.push([key, value]);` (since the code in `canvas.js` won't be able to deal with a `Ref` for those cases).
The reason that I didn't use `Dict_forEach` instead, is that it would re-introduce the unncessary closures that PR 5205 removed.
The intention of PR 5192 was to avoid adding empty `setGState` ops to the operatorList. But the patch accidentally used `>=`, which means that it's not actually working as intended, since empty arrays always have `length === 0`.
Even though the currently known test-cases render correctly without this patch, that seems more like a lucky coincidence, given that there's no guarantee that `transferMap[255] === 0` for every possible transfer function.
This patch fixes an issue that I inadvertently introduced in PR 5815, where we accidentally modify the `Differences` array in the encoding dictionary for indirect objects.
Instead of this change, we could also have used the now existing `Dict_getArray`. However in this case I don't think that would have been a good idea, since it would mean iterating through the array *twice*.
This patch goes a bit further than issue 6612 requires, and replaces all kinds of whitespace with standard spaces.
When testing this locally, it actually seemed to slightly improve two existing test-cases (`tracemonkey-text` and `taro-text`).
Fixes 6612.