Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
Calixte Denizet	18e79e3c0b	[text selection] Add the whitespaces present in the pdf in the text chunk - it aims to fix issue #14627; - the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces. But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream they weren't in the text chunks because they were too small. Hence we added some exceptions, for example, we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj. So basically, this patch removes the constraint to have the chars in the same Tj (in using a circular buffer to save the two last chars) but don't add a space when the visible space is really too small (hence `NOT_A_SPACE_FACTOR`).	2022-03-27 14:34:56 +02:00
Jonas Jenwald	73d2ddac0d	Update npm packages Note that the Prettier update made it possible to move a couple of comments after `default:`-cases back to their original/intended positions, please see https://prettier.io/blog/2022/03/16/2.6.0.html	2022-03-20 10:59:13 +01:00
Jonas Jenwald	c0736647f9	Add general iteration support in the `RefSet` and `RefSetCache` classes This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.	2022-03-18 14:27:34 +01:00
Jonas Jenwald	d0d5c596fb	When `stopAtErrors` is set, throw rather than warn when exceeding `maxImageSize` (issue 14626) The situation described in issue 14626 seems like a fairly special case, and it thus seem reasonable that we simply follow the same pattern as elsewhere in the `PartialEvaluator` when the `stopAtErrors` API-option is being used.	2022-03-03 13:11:29 +01:00
Jonas Jenwald	99cd24ce3e	Remove the `isString` helper function The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isString`-calls.	2022-02-26 16:33:41 +01:00
Jonas Jenwald	05edd91bdb	Remove the `isNum` helper function The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isNum`-calls. These changes were mostly done using regular expression search-and-replace, with two exceptions: - In `Font._charToGlyph` we no longer unconditionally update the `width`, since that seems completely unnecessary. - In `PDFDocument.documentInfo`, when parsing custom entries, we now do the `typeof`-check once.	2022-02-22 11:55:34 +01:00
Jonas Jenwald	b282814e38	Prefer `instanceof Name` rather than calling `isName()` with one argument Unless you actually need to check that something is both a `Name` and also of the correct type, using `instanceof Name` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check. This patch uses ESLint to enforce this, since we obviously still want to keep the `isName` helper function for where it makes sense.	2022-02-21 12:45:00 +01:00
Jonas Jenwald	4df82ad31e	Prefer `instanceof Dict` rather than calling `isDict()` with one argument Unless you actually need to check that something is both a `Dict` and also of the correct type, using `instanceof Dict` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check. This patch uses ESLint to enforce this, since we obviously still want to keep the `isDict` helper function for where it makes sense.	2022-02-21 12:44:56 +01:00
Jonas Jenwald	2cb2f633ac	Remove the `isRef` helper function This helper function is not really needed, since it's just a wrapper around a simple `instanceof` check, and it only adds unnecessary indirection in the code.	2022-02-19 15:33:42 +01:00
Jonas Jenwald	1a31855977	Remove the `isStream` helper function At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.	2022-02-17 13:51:36 +01:00
Calixte Denizet	18e3a98c2b	[api-minor] Don't add in the text content the chars which are out-of-page (bug 1755201) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1755201; - if the glyph position is not within the view then skip it.	2022-02-13 21:07:11 +01:00
Jonas Jenwald	64f3dbeb48	Let `Lexer.getNumber` treat a single minus sign as zero (bug 1753983) This appears to be consistent with the behaviour in both Adobe Reader and PDFium (in Google Chrome); this is essentially the same approach as used for a single decimal point in PR 9827.	2022-02-07 17:09:47 +01:00
Jonas Jenwald	403baa7bba	[api-minor] Remove the `normalizeWhitespace` option in the `PDFPageProxy.{getTextContent, streamTextContent}` methods (issue 14519, PR 14428 follow-up) With these changes, we'll now always replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.	2022-02-03 09:17:22 +01:00
Calixte Denizet	3a7004ca25	Take into account all rotations before comparing glyph positions - it aims to fix #14497; - previously, only rotations with an angle 0, 90, 180 or 270 were taken into account; - so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.	2022-01-26 17:19:00 +01:00
Calixte Denizet	e1d3a3b414	Remove the invisible format marks from the text chunks - it aims to fix issue #9186.	2022-01-24 13:47:24 +01:00
Jonas Jenwald	ba37d600d7	Make the `normalizeWhitespace` handling, in the `PartialEvaluator`, more efficient (PR 14428 follow-up) After the changes in PR 14428 we can directly, and more efficiently, handle whitespace conversion in `PartialEvaluator.getTextContent` when the `normalizeWhitespace` option is being used. This way we no longer need a separate helper function for this, and can avoid having to (again) iterate through the text and checking each character. Finally, this also removes the need for using a regular expression on e.g. all non-ASCII text.	2022-01-16 08:29:21 +01:00
calixteman	da953f4b64	Merge pull request #14428 from calixteman/typo Use the correct dimension to know if we have to add an EOL in vertical mode	2022-01-15 12:47:10 -08:00
Calixte Denizet	9dae421a0d	Handle all the whitespaces the same way when creating text chunks	2022-01-15 21:44:00 +01:00
Jonas Jenwald	53d4ee7990	Prevent circular references in Type3 fonts In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing. Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs Please note: That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents. Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.	2022-01-13 17:58:37 +01:00
Calixte Denizet	9bb636402a	Use the correct dimension to know if we have to add an EOL in vertical mode	2022-01-07 15:19:03 +01:00
Calixte Denizet	6cdae5ac4d	Use positive dimensions for text chunks in the text layer (issue #14415 ).	2022-01-05 10:49:56 +01:00
Jonas Jenwald	6da0944fc7	[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter Please note: These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents. The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for every rendered page. This patch proposes replacing that method with a synchronous `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and only send them to the main-thread the first time that a type is encountered. Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1] This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes. Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return the same identical object. This is something that we can easily take advantage of in the default viewer, by now only reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents). --- [1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see `41ac3f0c07/src/shared/util.js (L206-L232)` [2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread. [3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code. In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549	2021-11-20 12:20:55 +01:00
Calixte Denizet	a88ff34eb7	Don't consider space as real space when there is an extra spacing (bug 931481) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481; - real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one; - for example, an extra spacing can cancel a space so visually there are no space.	2021-11-12 18:53:48 +01:00
Jonas Jenwald	ea1c348c67	Always prefer abbreviated keys, over full ones, when doing any dictionary lookups (issue 14256) Note that issue 14256 was specifically about inline images, please refer to: - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.1852045 - https://www.pdfa.org/safedocs-unearths-pdf-inline-image-issue/ - https://pdf-issues.pdfa.org/32000-2-2020/clause08.html#H8.9.7 However, during review of the initial PR in https://github.com/mozilla/pdf.js/pull/14257#issuecomment-964469710, it was suggested that we instead do this unconditionally for all dictionary lookups. In addition to re-ordering the existing call-sites in the `src/core`-code, and adding non-PRODUCTION/TESTING asserts to catch future errors, for consistency a number of existing `if`/`switch`-blocks were re-factored to also check the abbreviated keys first.	2021-11-10 11:56:18 +01:00
Brendan Dahl	8161d3f29d	Don't double apply a group xobject's bbox. In `beginGroup` we create a new canvas that is the size of the bounding box and we translate it to the offset. This means we don't need to also apply the bounding box during `paintFormXObjectBegin`. This improves #6961 quite a bit, but it still is missing the indention in the ruler.	2021-11-05 15:40:58 -07:00
calixteman	bbb64369f1	Merge pull request #13424 from calixteman/chunks2 [api-minor] Fix issues in text selection	2021-10-18 06:14:15 -07:00
Calixte Denizet	61d1063276	Fix issues in text selection - PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).	2021-10-17 16:27:05 +02:00
Jonas Jenwald	69a97bcba7	Take the /CIDToGIDMap data into account when computing the hash, in `PartialEvaluator.preEvaluateFont`, for composite fonts (bug 1734802) This is unfortunately yet another bug in the `preEvaluateFont`-implementation, and I've lost count of the number of times I've had to tweak this code over the years :-( I really cannot help thinking that PR 4423 was way too simplistic, since it missed a bunch of cases that leads to broken font rendering in many PDF documents. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1734802	2021-10-08 13:15:21 +02:00
Jonas Jenwald	9acfe486d4	Fallback to font name matching, when checking for serif fonts (issue 13845) In order to handle fonts that specify completely bogus /Flags-entries, fallback to font name matching to determine if the font is a serif one.	2021-09-23 01:11:57 +02:00
Jonas Jenwald	ed73cf6d50	Support cmaps with only CID characters, when building the ToUnicode-map (issue 9367) In this particular case the `CMap`-data that we create contains only numbers, but no strings, which causes `PartialEvaluator.readToUnicode` to create a ToUnicode-map with only empty strings. Please note: This is yet another case where I don't know if it's necessarily the best and most correct solution, but it does fix the referenced issue.	2021-09-18 00:26:15 +02:00
Tim van der Meij	e97f01b17c	Merge pull request #13977 from Snuffleupagus/enqueueChunk-batch [api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962)	2021-09-11 13:34:07 +02:00
Brendan Dahl	f38fb42b42	Enable/disable image smoothing based on image interpolate value. (bug 1722191) While some of the output looks worse to my eye, this behavior more closely matches what I see when I open the PDFs in Adobe acrobat. Fixes: #4706, #9713, #8245, #1344	2021-09-10 14:23:35 -07:00
Jonas Jenwald	f90f9466e3	[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1] Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building. While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used. That seems extremely unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2] Given that this patch adds a minimum batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed). While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a simple stand-in for that functionality. Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively. --- [1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large. [2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257. Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here. [3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.	2021-09-09 00:01:07 +02:00
Jonas Jenwald	b34d2cdc42	Ensure that beginMarkedContentProps/endMarkedContent-operators, for /XObjects, are balanced in corrupt documents (PR 13854 follow-up) Something that I just realized is that while PR 13854 fixed an issue as reported, it could still cause bugs in other similarily broken documents since we'll not insert a matching endMarkedContent-operator in the operatorList.	2021-08-26 17:05:30 +02:00
Jonas Jenwald	853b1172a1	Support Optional Content in Image-/XObjects (issue 13931) Currently, in the `PartialEvaluator`, we only support Optional Content in Form-/XObjects. Hence this patch adds support for Image-/XObjects as well, which looks like a simple oversight in PR 12095 since the canvas-implementation already contains the necessary code to support this.	2021-08-26 16:54:15 +02:00
Jonas Jenwald	5f25fea0fe	Re-factor the `LocalTilingPatternCache` to cache by Ref rather than Name (PR 12458 follow-up, issue 13780) This way there cannot be any incorrect cache hits, since Refs are guaranteed to be unique. Please note that the reason for caching by Ref rather than doing something along the lines of the `localShadingPatternCache` (which uses a `Map` directly), is that TilingPatterns are streams and those cannot be cached on the `XRef`-instance (this way we avoid unnecessary parsing).	2021-08-18 12:49:01 +02:00
Brendan Dahl	4ad5c5d52a	Merge pull request #13808 from brendandahl/pattern-cache-v2 Improve caching of shading patterns. (bug 1721949)	2021-07-28 11:17:16 -07:00
Brendan Dahl	c836e1f0fb	Improve caching of shading patterns. (bug 1721949) The PDF in bug 1721949 uses many unique pattern objects that references the same shading many times. This caused a new canvas pattern to be created and cached many times driving up memory use. To fix, I've changed the cache in the worker to key off the shading object and instead send the shading and matrix separately. While that worked well to fix the above bug, there could be PDFs that use many shading that could cause memory issues, so I've also added a LRU cache on the main thread for canvas patterns. This should prevent memory use from getting too high.	2021-07-28 10:29:20 -07:00
Calixte Denizet	4a4591bd2c	XFA - Fix font scale factors (bug 1720888) - All the scale factors in for the substitution font were wrong because of different glyph positions between Liberation and the other ones: - regenerate all the factors - Text may have polish chars for example and in this case the glyph widths were wrong: - treat substitution font as a composite one - add a map glyphIndex to unicode for Liberation in order to generate width array for cid font	2021-07-28 19:10:42 +02:00
Calixte Denizet	76d882b560	XFA - Fix auto-sized fields (bug 1722030) - In order to better compute text fields size, use line height with no gaps (and consequently guessed height for text are slightly better in general). - Fix default background color in fields.	2021-07-28 09:43:15 +02:00
Brendan Dahl	da1af02ac8	Improve performance of reused patterns. Bug 1721218 has a shading pattern that was used thousands of times. To improve performance of this PDF: - add a cache for patterns in the evaluator and only send the IR form once to the main thread (this also makes caching in canvas easier) - cache the created canvas radial/axial patterns - for shading fill radial/axial use the pattern directly instead of creating temporary canvas	2021-07-22 16:47:40 -07:00
Jonas Jenwald	3838c4e27c	Re-factor the handling of empty `Name`-instances (PR 13612 follow-up) When working on PR 13612, I mostly prioritized a simple solution that didn't require touching a lot of code. However, while working on PR 13735 I started to realize that the static `Name.empty` construction really wasn't a good idea. In particular, having a special `Name`-instance where the `name`-property isn't actually a String is confusing (to put it mildly) and can easily lead to issues elsewhere. The only reason for not simply allowing the `name`-property to be an empty string, in PR 13612, was to avoid having to touch a lot of existing code. However, it turns out that this is only limited to a few methods in the `PartialEvaluator` and a few of the `BaseLocalCache`-implementations, all of which can be easily re-factored to handle empty `Name`-instances. All-in-all, I think that this patch is even an overall improvement since we're now validating (what should always be) `Name`-data better in the `PartialEvaluator`. This is what I ought to have done from the start, sorry about the code churn here!	2021-07-15 12:00:42 +02:00
Calixte Denizet	58e1f51688	XFA - Fix text positions (bug 1718741) - font line height is taken into account by acrobat when it isn't with masterpdfeditor: I extracted a font from a pdf, modified some ascent/descent properties thanks to ttx and the reinjected the font in the pdf: only Acrobat is taken it into account. So in this patch, line heights for some substituted fonts are added. - it seems that Acrobat is using a line height of 1.2 when the line height in the font is not enough (it's the only way I found to fix correctly bug 1718741). - don't use flex in wrapper container (which was causing an horizontal overflow in the above bug). - consequently, the above fixes introduced a lot of small regressions, so in order to see real improvements on reftests, I fixed the regressions in this patch: - replace margin by padding in some case where padding is a part of a container dimensions; - remove some flex display: some containers are wrongly sized when rendered; - set letter-spacing to 0.01px: it helps to be sure that text is not broken because of not enough width in Firefox.	2021-07-09 18:11:12 +02:00
Jonas Jenwald	273d8cb746	Add non-PRODUCTION/TESTING overflow `assert`s to various string helper-functions (issue 6759)	2021-06-27 16:06:30 +02:00
Jonas Jenwald	c4334dcfe7	Allow using the standard font data for non-Type1 fonts (issue 13585, PR 12726 follow-up) Given that we're not imposing any font-type restrictions[1] in the non-/FontDescriptor case, it's not really clear to me why we'd actually need to do that in the general case. Please note that there's some expected movement, all of which should be improvements, in the `fips197.pdf` file with this patch. --- [1] With the exception of Type3-fonts, of course.	2021-06-20 11:13:49 +02:00
Jonas Jenwald	d9ed14a2f5	Set the default value of `useSystemFonts` correctly, depending on `disableFontFace`, in the API (PR 13516 follow-up) Sorry about the churn here, since the change that I made in PR 13516 was not very smart. With the current code, it's now impossible for a user to actually control the `useSystemFonts` option manually. To prevent outright breakage we obviously still need to default to setting `useSystemFonts = false` when `disableFontFace === true`, however that should be possible for an API consumer to override.	2021-06-19 13:53:13 +02:00
Jonas Jenwald	229a49b9b9	Re-factor the `fallbackToUnicode` functionality (PR 9192 follow-up) Rather than having to create and check a separate `ToUnicodeMap` to handle these cases, we can simply use the `fallbackToUnicode`-data (when it exists) to directly supplement missing /ToUnicode entires in the regular `ToUnicodeMap` instead.	2021-06-14 15:05:14 +02:00
Jonas Jenwald	edc38de37a	Convert `PartialEvaluator.buildToUnicode` to an `async` method This removes the need to manually wrap all return values in a Promise.	2021-06-14 15:05:14 +02:00
Jonas Jenwald	69477bfb06	Always use standard font data, with `disableFontFace` set in the API (PR 12726 follow-up) We must force-fetch standard font data, when `disableFontFace = true` is set in the API, since otherwise rendering in e.g. the viewer is still broken (same as before PR 12726 landed). Please note: We still need to also load standard font data for patterns and/or some text-rendering modes, however that will require larger changes so I figured that it cannot hurt to submit this patch right now.	2021-06-09 21:21:02 +02:00
Jonas Jenwald	a01c599247	Cache the "raw" standard font data in the worker-thread (PR 12726 follow-up) This implementation is basically a copy of the pre-existing `builtInCMapCache` implementation. For some, badly generated, PDF documents it's possible that we'll end up having to fetch the same standard font data over and over (which is obviously inefficient). While not common, it's certainly possible that a PDF document uses custom font names where the actual font then references one of the standard fonts; see e.g. issue 11399 for one such example. Note that I did suggest adding worker-thread caching of standard font data in PR 12726, however it wasn't deemed necessary at the time. Now that we have a real-world example that benefit from caching, I think that we should simply implement this now.	2021-06-09 18:27:51 +02:00

1 2 3 4 5 ...