Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
calixteman	bbb64369f1	Merge pull request #13424 from calixteman/chunks2 [api-minor] Fix issues in text selection	2021-10-18 06:14:15 -07:00
Calixte Denizet	61d1063276	Fix issues in text selection - PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).	2021-10-17 16:27:05 +02:00
Jonas Jenwald	69a97bcba7	Take the /CIDToGIDMap data into account when computing the hash, in `PartialEvaluator.preEvaluateFont`, for composite fonts (bug 1734802) This is unfortunately yet another bug in the `preEvaluateFont`-implementation, and I've lost count of the number of times I've had to tweak this code over the years :-( I really cannot help thinking that PR 4423 was way too simplistic, since it missed a bunch of cases that leads to broken font rendering in many PDF documents. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1734802	2021-10-08 13:15:21 +02:00
Jonas Jenwald	9acfe486d4	Fallback to font name matching, when checking for serif fonts (issue 13845) In order to handle fonts that specify completely bogus /Flags-entries, fallback to font name matching to determine if the font is a serif one.	2021-09-23 01:11:57 +02:00
Jonas Jenwald	ed73cf6d50	Support cmaps with only CID characters, when building the ToUnicode-map (issue 9367) In this particular case the `CMap`-data that we create contains only numbers, but no strings, which causes `PartialEvaluator.readToUnicode` to create a ToUnicode-map with only empty strings. Please note: This is yet another case where I don't know if it's necessarily the best and most correct solution, but it does fix the referenced issue.	2021-09-18 00:26:15 +02:00
Tim van der Meij	e97f01b17c	Merge pull request #13977 from Snuffleupagus/enqueueChunk-batch [api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962)	2021-09-11 13:34:07 +02:00
Brendan Dahl	f38fb42b42	Enable/disable image smoothing based on image interpolate value. (bug 1722191) While some of the output looks worse to my eye, this behavior more closely matches what I see when I open the PDFs in Adobe acrobat. Fixes: #4706, #9713, #8245, #1344	2021-09-10 14:23:35 -07:00
Jonas Jenwald	f90f9466e3	[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1] Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building. While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used. That seems extremely unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2] Given that this patch adds a minimum batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed). While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a simple stand-in for that functionality. Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively. --- [1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large. [2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257. Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here. [3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.	2021-09-09 00:01:07 +02:00
Jonas Jenwald	b34d2cdc42	Ensure that beginMarkedContentProps/endMarkedContent-operators, for /XObjects, are balanced in corrupt documents (PR 13854 follow-up) Something that I just realized is that while PR 13854 fixed an issue as reported, it could still cause bugs in other similarily broken documents since we'll not insert a matching endMarkedContent-operator in the operatorList.	2021-08-26 17:05:30 +02:00
Jonas Jenwald	853b1172a1	Support Optional Content in Image-/XObjects (issue 13931) Currently, in the `PartialEvaluator`, we only support Optional Content in Form-/XObjects. Hence this patch adds support for Image-/XObjects as well, which looks like a simple oversight in PR 12095 since the canvas-implementation already contains the necessary code to support this.	2021-08-26 16:54:15 +02:00
Jonas Jenwald	5f25fea0fe	Re-factor the `LocalTilingPatternCache` to cache by Ref rather than Name (PR 12458 follow-up, issue 13780) This way there cannot be any incorrect cache hits, since Refs are guaranteed to be unique. Please note that the reason for caching by Ref rather than doing something along the lines of the `localShadingPatternCache` (which uses a `Map` directly), is that TilingPatterns are streams and those cannot be cached on the `XRef`-instance (this way we avoid unnecessary parsing).	2021-08-18 12:49:01 +02:00
Brendan Dahl	4ad5c5d52a	Merge pull request #13808 from brendandahl/pattern-cache-v2 Improve caching of shading patterns. (bug 1721949)	2021-07-28 11:17:16 -07:00
Brendan Dahl	c836e1f0fb	Improve caching of shading patterns. (bug 1721949) The PDF in bug 1721949 uses many unique pattern objects that references the same shading many times. This caused a new canvas pattern to be created and cached many times driving up memory use. To fix, I've changed the cache in the worker to key off the shading object and instead send the shading and matrix separately. While that worked well to fix the above bug, there could be PDFs that use many shading that could cause memory issues, so I've also added a LRU cache on the main thread for canvas patterns. This should prevent memory use from getting too high.	2021-07-28 10:29:20 -07:00
Calixte Denizet	4a4591bd2c	XFA - Fix font scale factors (bug 1720888) - All the scale factors in for the substitution font were wrong because of different glyph positions between Liberation and the other ones: - regenerate all the factors - Text may have polish chars for example and in this case the glyph widths were wrong: - treat substitution font as a composite one - add a map glyphIndex to unicode for Liberation in order to generate width array for cid font	2021-07-28 19:10:42 +02:00
Calixte Denizet	76d882b560	XFA - Fix auto-sized fields (bug 1722030) - In order to better compute text fields size, use line height with no gaps (and consequently guessed height for text are slightly better in general). - Fix default background color in fields.	2021-07-28 09:43:15 +02:00
Brendan Dahl	da1af02ac8	Improve performance of reused patterns. Bug 1721218 has a shading pattern that was used thousands of times. To improve performance of this PDF: - add a cache for patterns in the evaluator and only send the IR form once to the main thread (this also makes caching in canvas easier) - cache the created canvas radial/axial patterns - for shading fill radial/axial use the pattern directly instead of creating temporary canvas	2021-07-22 16:47:40 -07:00
Jonas Jenwald	3838c4e27c	Re-factor the handling of empty `Name`-instances (PR 13612 follow-up) When working on PR 13612, I mostly prioritized a simple solution that didn't require touching a lot of code. However, while working on PR 13735 I started to realize that the static `Name.empty` construction really wasn't a good idea. In particular, having a special `Name`-instance where the `name`-property isn't actually a String is confusing (to put it mildly) and can easily lead to issues elsewhere. The only reason for not simply allowing the `name`-property to be an empty string, in PR 13612, was to avoid having to touch a lot of existing code. However, it turns out that this is only limited to a few methods in the `PartialEvaluator` and a few of the `BaseLocalCache`-implementations, all of which can be easily re-factored to handle empty `Name`-instances. All-in-all, I think that this patch is even an overall improvement since we're now validating (what should always be) `Name`-data better in the `PartialEvaluator`. This is what I ought to have done from the start, sorry about the code churn here!	2021-07-15 12:00:42 +02:00
Calixte Denizet	58e1f51688	XFA - Fix text positions (bug 1718741) - font line height is taken into account by acrobat when it isn't with masterpdfeditor: I extracted a font from a pdf, modified some ascent/descent properties thanks to ttx and the reinjected the font in the pdf: only Acrobat is taken it into account. So in this patch, line heights for some substituted fonts are added. - it seems that Acrobat is using a line height of 1.2 when the line height in the font is not enough (it's the only way I found to fix correctly bug 1718741). - don't use flex in wrapper container (which was causing an horizontal overflow in the above bug). - consequently, the above fixes introduced a lot of small regressions, so in order to see real improvements on reftests, I fixed the regressions in this patch: - replace margin by padding in some case where padding is a part of a container dimensions; - remove some flex display: some containers are wrongly sized when rendered; - set letter-spacing to 0.01px: it helps to be sure that text is not broken because of not enough width in Firefox.	2021-07-09 18:11:12 +02:00
Jonas Jenwald	273d8cb746	Add non-PRODUCTION/TESTING overflow `assert`s to various string helper-functions (issue 6759)	2021-06-27 16:06:30 +02:00
Jonas Jenwald	c4334dcfe7	Allow using the standard font data for non-Type1 fonts (issue 13585, PR 12726 follow-up) Given that we're not imposing any font-type restrictions[1] in the non-/FontDescriptor case, it's not really clear to me why we'd actually need to do that in the general case. Please note that there's some expected movement, all of which should be improvements, in the `fips197.pdf` file with this patch. --- [1] With the exception of Type3-fonts, of course.	2021-06-20 11:13:49 +02:00
Jonas Jenwald	d9ed14a2f5	Set the default value of `useSystemFonts` correctly, depending on `disableFontFace`, in the API (PR 13516 follow-up) Sorry about the churn here, since the change that I made in PR 13516 was not very smart. With the current code, it's now impossible for a user to actually control the `useSystemFonts` option manually. To prevent outright breakage we obviously still need to default to setting `useSystemFonts = false` when `disableFontFace === true`, however that should be possible for an API consumer to override.	2021-06-19 13:53:13 +02:00
Jonas Jenwald	229a49b9b9	Re-factor the `fallbackToUnicode` functionality (PR 9192 follow-up) Rather than having to create and check a separate `ToUnicodeMap` to handle these cases, we can simply use the `fallbackToUnicode`-data (when it exists) to directly supplement missing /ToUnicode entires in the regular `ToUnicodeMap` instead.	2021-06-14 15:05:14 +02:00
Jonas Jenwald	edc38de37a	Convert `PartialEvaluator.buildToUnicode` to an `async` method This removes the need to manually wrap all return values in a Promise.	2021-06-14 15:05:14 +02:00
Jonas Jenwald	69477bfb06	Always use standard font data, with `disableFontFace` set in the API (PR 12726 follow-up) We must force-fetch standard font data, when `disableFontFace = true` is set in the API, since otherwise rendering in e.g. the viewer is still broken (same as before PR 12726 landed). Please note: We still need to also load standard font data for patterns and/or some text-rendering modes, however that will require larger changes so I figured that it cannot hurt to submit this patch right now.	2021-06-09 21:21:02 +02:00
Jonas Jenwald	a01c599247	Cache the "raw" standard font data in the worker-thread (PR 12726 follow-up) This implementation is basically a copy of the pre-existing `builtInCMapCache` implementation. For some, badly generated, PDF documents it's possible that we'll end up having to fetch the same standard font data over and over (which is obviously inefficient). While not common, it's certainly possible that a PDF document uses custom font names where the actual font then references one of the standard fonts; see e.g. issue 11399 for one such example. Note that I did suggest adding worker-thread caching of standard font data in PR 12726, however it wasn't deemed necessary at the time. Now that we have a real-world example that benefit from caching, I think that we should simply implement this now.	2021-06-09 18:27:51 +02:00
Calixte Denizet	34a2fa72c7	XFA - Add Liberation-Sans font as a substitution for some missing fonts - Some js files contain scale factors for each glyph in order to rescale Liberation to have a final font with the correct width. - A lot of XFA have some containers where their dimensions are based on their text content, so using default font from browser can lead to an almost unreadable pdf.	2021-06-09 16:55:45 +02:00
Jonas Jenwald	d995f90183	Fetch binary CMap data in the worker-thread, when `useWorkerFetch` is set This patch uses the new option added in PR 12726 to also allow fetching binary CMap data directly in the worker-thread in browsers. Given that these changes remove the need to transfer data between threads for the default (browser) use-case, we can also revert the changes in PR 11118 since that simplifies the overall implementation.	2021-06-08 21:51:07 +02:00
Jonas Jenwald	e7dc822e74	Merge pull request #12726 from brendandahl/standard-fonts [api-minor] Include and use the 14 standard font files.	2021-06-08 10:09:40 +02:00
Brendan Dahl	4c1dd47e65	Include and use the 14 standard fonts files.	2021-06-07 11:10:11 -07:00
Jonas Jenwald	eefc94ceb7	Ensure that we fully load Type3 fonts in `PartialEvaluator.getTextContent` This is necessary now, since with the previous patch the /FontBBox potentially depends on the contents of the /CharProcs-streams. Note that if `getOperatorList` is called before `getTextContent`, this patch doesn't matter since the font is already fully loaded/parsed. However, for e.g. the `text` test-cases this is necessary to ensure correct reference images.	2021-06-05 08:09:29 +02:00
Jonas Jenwald	20770cb06a	Improve text-selection for Type3 fonts with empty /FontBBox-entries (issue 6605) For Type3 fonts where the /CharProcs-streams of the individual glyph starts with a `d1` operator, we can use that to build a fallback bounding box for the font and thus improve text-selection in some cases.	2021-06-05 08:09:29 +02:00
Jonas Jenwald	e3bde56311	Ensure that the old/new `options` are correctly combined in `PartialEvaluator.clone`	2021-05-31 12:14:53 +02:00
Jonas Jenwald	c4429bc3f2	Do the `isType3Font`-check once, rather than repeating it, in `PartialEvaluator.translateFont` This is a small piece of clean-up that I happened to notice while browsing the code.	2021-05-22 11:46:37 +02:00
Jonas Jenwald	68350378c0	Handle errors gracefully, in `PartialEvaluator.buildFontPaths`, when glyph path building fails The building of glyph paths, in the `FontRendererFactory`, can fail in various ways for corrupt font data. However, we're currently not attempting to handle any such errors in the evaluator, which means that a single broken glyph can prevent an entire page from rendering. To address this we simply have to pass along, and check, the existing `ignoreErrors` option in `PartialEvaluator.buildFontPaths` similar to the rest of the `PartialEvaluator` code.	2021-05-22 11:46:31 +02:00
Jonas Jenwald	718f7bf7e1	Fix a few safe ESLint `no-var` failures in `src/core/evaluator.js` (13371 follow-up) As can be seen in PR 13371, some of the `no-var` changes in the `PartialEvaluator.{getOperatorList, getTextContent}` methods caused errors in `gulp server`-mode. However, there's a handful of instances of `var` in other methods which should be completely safe to convert since there's no strange scope-issues present in that code.	2021-05-16 15:22:43 +02:00
Jonas Jenwald	8943bcd3c3	Account for formatting changes in Prettier version `2.3.0` With the exception of one tweaked `eslint-disable` comment, in `web/generic_scripting.js`, this patch was generated automatically using `gulp lint --fix`. Please find additional information at: - https://github.com/prettier/prettier/releases/tag/2.3.0 - https://prettier.io/blog/2021/05/09/2.3.0.html	2021-05-16 11:44:05 +02:00
Jonas Jenwald	75208d36c2	Revert "Fix the remaining `no-var` failures, which couldn't be handled automatically, in the `src/core/evaluator.js` file" (PR 13344 follow-up) This reverts commit 0ef9b5aafc88094f19fec793c174c622e7e15542, since it cases a lot of warnings (see below) locally with e.g. the document from issue 9627. Strangely enough, this only occurs with `gulp server`-mode and the actual builds are apparently fine. It seems that this may be some unfortunate interaction with the old Babel-plugin that's used together with SystemJS. ``` Warning: getTextContent - ignoring ExtGState: "FormatError: ExtGState should be a dictionary.". ``` Rather than taking the risk that this could actually cover a more serious bug, and since I cannot immediately figure out what's wrong, it thus seem safest to revert this for now and we can (carefully) revisit this once SystemJS has been removed (see PR 12563).	2021-05-13 11:19:46 +02:00
Jonas Jenwald	6eef69de22	Export the "raw" `toUnicode`-data from `PartialEvaluator.preEvaluateFont` Compared to other data-structures, such as e.g. `Dict`s, we're purposely not caching Streams on the `XRef`-instance.[1] The, somewhat unfortunate, effect of Streams not being cached is that repeatedly getting the same Stream-data requires re-parsing/re-initializing of a bunch of data; see `XRef.fetch` and related methods. For the font-parsing in particular we're currently fetching the `toUnicode`-data, which is very often a Stream, in `PartialEvaluator.preEvaluateFont` and then again in `PartialEvaluator.extractDataStructures` soon afterwards. By instead letting `PartialEvaluator.preEvaluateFont` export the "raw" `toUnicode`-data, we can avoid some unnecessary re-parsing/re-initializing when handling fonts. Please note: In this particular case, given that `PartialEvaluator.preEvaluateFont` only accesses the "raw" `toUnicode` data, exporting a Stream should be safe. --- [1] The reasons for this include: - Streams, especially `DecodeStream`-instances, can become very large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so. - Attempting to read from the same Stream-instance more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position. - Given that parsing, even in the worker-thread, is now fairly asynchronous it's generally impossible to assert that any one Stream-instance isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Stream-instance isn't going to work in the general case.	2021-05-08 12:04:13 +02:00
Jonas Jenwald	13fb1654dc	Export the `firstChar`/`lastChar`-data from `PartialEvaluator.preEvaluateFont` Rather than re-fetching/re-parsing these properties immediately in `PartialEvaluator.translateFont`, we can simply export them instead. (Obviously the effect will be really tiny, but there is less parsing overall this way.)	2021-05-08 12:02:49 +02:00
Jonas Jenwald	8a1cb82aee	Ensure that the `Widths` array is parsed correctly in `PartialEvaluator.preEvaluateFont` Please note: While I don't have a document that this patches fixes, the current code is however not entirely correct as far as I can tell. Looking at how the `Widths` array is parsed in `PartialEvaluator.extractWidths`, it's clear that the implementation in `PartialEvaluator.preEvaluateFont` is a bit too simplistic. In particular, by only wrapping the data into a TypedArray, there's no attempt to handle indirect objects which could potentially lead to colliding `hash`es being computed.	2021-05-07 21:23:44 +02:00
Jonas Jenwald	30b2739adf	Ensure that composite/non-composite fonts won't get the same `hash` in `PartialEvaluator.preEvaluateFont` To hopefully help prevent any future bugs, make sure that composite/non-composite fonts cannot accidentally get matching `hash`es. Given the differences between those font types, that's very unlikely to be useful or even correct in general.	2021-05-07 21:22:37 +02:00
Jonas Jenwald	fc59a5f709	Take the `W` array into account when computing the hash, in `PartialEvaluator.preEvaluateFont`, for composite fonts (issue 13343) Without this some composite fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts. Please note: Given that the document, in the referenced issue, doesn't embed any of its fonts there's no guarantee that it renders correctly in all configurations even with this patch.	2021-05-07 21:22:36 +02:00
Jonas Jenwald	0ef9b5aafc	Fix the remaining `no-var` failures, which couldn't be handled automatically, in the `src/core/evaluator.js` file The only slight complication here were some of the `switch`-cases, in `getOperatorList`/`getTextContent`, where the parsing is done asynchronously. However, those cases are easy to deal with by wrapping the code within its own block; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/switch#block-scope_variables_within_switch_statements	2021-05-06 10:21:05 +02:00
Jonas Jenwald	f93c3b9aa7	Enable the `no-var` rule in the `src/core/evaluator.js` file These changes were made automatically, using `gulp lint --fix`.	2021-05-06 09:39:21 +02:00
Jonas Jenwald	77b258440b	Move some constants and helper functions `from src/core/fonts.js` and into their own file - `FontFlags`, is used in both `src/core/fonts.js` and `src/core/evaluator.js`. - `getFontType`, same as the above. - `MacStandardGlyphOrdering`, is a fairly large data-structure and `src/core/fonts.js` is already a very large file. - `recoverGlyphName`, a dependency of `type1FontGlyphMapping`; please see below. - `SEAC_ANALYSIS_ENABLED`, is used by both `Type1Font`, `CFFFont`, and unit-tests; please see below. - `type1FontGlyphMapping`, is used by both `Type1Font` and `CFFFont` which a later patch will move to their own files.	2021-05-02 21:00:29 +02:00
Jonas Jenwald	6912bb5e0a	Move the `IdentityToUnicodeMap`/`ToUnicodeMap` from `src/core/fonts.js` and into its own file	2021-05-02 21:00:29 +02:00
Tim van der Meij	f6f335173d	Merge pull request #13303 from Snuffleupagus/BaseStream Add an abstract base-class, which all the various Stream implementations inherit from	2021-05-01 19:13:36 +02:00
calixteman	af4dc55019	[api-minor] Fix the way to chunk the strings (#13257 ) - Improve chunking in order to fix some bugs where the spaces aren't here: * track the last position where a glyph has been drawn; * when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break: - there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions; - and so there is now one function (i.e. "compareWithLastPosition") where all the job is done. - Add some breaks in order to get lines; - Remove the multiple whites spaces: * some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool; * other pdf readers replace spaces by one white space. Update src/core/evaluator.js Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com> Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>	2021-04-30 14:41:13 +02:00
Jonas Jenwald	30a22a168d	Move the `DecodeStream` and `StreamsSequenceStream` from `src/core/stream.js` and into its own file	2021-04-28 10:16:51 +02:00
Jonas Jenwald	da22146b95	Replace a bunch of `Array.prototype.forEach()` cases with `for...of` loops instead Using `for...of` is a modern and generally much nicer pattern, since it gets rid of unnecessary callback-functions. (In a couple of spots, a "regular" `for` loop had to be used.)	2021-04-24 13:00:19 +02:00

1 2 3 4 5 ...