pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	2ff9799e7a	Tweak assignment of common parameters in the `Annotation` classes This is slightly more compact, and also unifies the format across the various classes.	2022-11-20 12:29:59 +01:00
Jonas Jenwald	c92de947b6	Reduce duplication when creating a fallback appearance for `MarkupAnnotation`s Currently we repeat the same color-conversion code verbatim in lots of classes, which seems completely unnecessary.	2022-11-20 12:05:25 +01:00
Tim van der Meij	d6908ee145	Merge pull request #15701 from Snuffleupagus/move-string-helpers Move some string helper functions to the worker-thread	2022-11-19 11:20:07 +01:00
Jonas Jenwald	70d362f22c	Remove an unnecessary variable in `getPdfManager`, in the `src/core/worker.js` file Another tiny piece of clean-up, since adding a `catch`-handler to a Promise shouldn't require an intermediate variable.	2022-11-17 15:31:41 +01:00
Jonas Jenwald	a2a200175f	Remove unnecessary function names in the `src/core/worker.js` file Currently some functions in this file have names while others don't, and in a few cases the names are no longer entirely accurate. For the relevant functions there should really be no need to name them, and if memory serves this was originally done since browsers (many years ago) didn't always handle anonymous functions correctly in stack traces.	2022-11-17 15:12:48 +01:00
Jonas Jenwald	9adc7859c8	Move the `escapeString` helper function into the worker-thread Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.	2022-11-16 12:35:48 +01:00
Jonas Jenwald	e5859e145d	Move the `isAscii` helper function into the worker-thread Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.	2022-11-16 12:35:48 +01:00
Jonas Jenwald	2eaa708e3a	Combine the `stringToUTF16String` and `stringToUTF16BEString` helper functions Given that these functions are virtually identical, with the latter only adding a BOM, we can combine the two. Furthermore, since both functions were only used on the worker-thread, there's no reason to duplicate this functionality in both of the `pdf.js` and `pdf.worker.js` files.	2022-11-16 12:35:44 +01:00
Jonas Jenwald	f358e76f5b	Move the `_isOffscreenCanvasSupported` property to the base `Annotation` class Having just played around with adding FreeText-annotations and then trying to print, there were `FreeTextAnnotation: OffscreenCanvas is not supported, annotation may not render correctly.` messages printed in the console. The reason for this is that `FreeTextAnnotation` inherits from `MarkupAnnotation`, however only `WidgetAnnotation` actually defines the `_isOffscreenCanvasSupported` property.	2022-11-15 16:30:53 +01:00
Jonas Jenwald	d22eb3591e	Change the `assert` in `Parser.findDefaultInlineStreamEnd` to a non-PRODUCTION one Given that this `assert` is only intended to catch any implementation bugs in our code, and not actually to validate the PDF data directly[1], we can avoid making this function call unconditionally. --- [1] In those cases, for example a `FormatError` should have been thrown instead.	2022-11-12 16:30:58 +01:00
Jonas Jenwald	595711bd7c	Merge pull request #15679 from Snuffleupagus/bug-1799927-2 Use the full inline image as the cacheKey in `Parser.makeInlineImage` (bug 1799927)	2022-11-10 22:54:48 +01:00
Calixte Denizet	3ca03603c2	[Annotation] Fix printing/saving for annotations containing some non-ascii chars and with no fonts to handle them (bug 1666824) - For text fields * when printing, we generate a fake font which contains some widths computed thanks to an OffscreenCanvas and its method measureText. In order to avoid to have to layout the glyphs ourselves, we just render all of them in one call in the showText method in using the system sans-serif/monospace fonts. * when saving, we continue to create the appearance streams if the fonts contain the char but when a char is missing, we just set, in the AcroForm dict, the flag /NeedAppearances to true and remove the appearance stream. This way, we let the different readers handle the rendering of the strings. - For FreeText annotations * when printing, we use the same trick as for text fields. * there is no need to save an appearance since Acrobat is able to infer one from the Content entry.	2022-11-10 19:05:39 +01:00
Jonas Jenwald	7abb6429b0	Initialize the dictionary lazily when parsing inline images This helps improve performance for some PDF documents with a huge number of inline images, e.g. the PDF document from issue 2618. Given that we no longer create `Stream`-instances unconditionally, we also don't need `Dict`-instances for cached inline images (since we only access the filter).	2022-11-10 18:27:26 +01:00
Jonas Jenwald	b46e0d61cf	Use the full inline image as the cacheKey in `Parser.makeInlineImage` (bug 1799927) Please note: This only fixes the "wrong letter" part of bug 1799927. It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly. Rather than switching to another hashing algorithm, e.g. the `MurmurHash3_64` class, we simply cache using a stringified version of the inline image data as the cacheKey to prevent any future collisions. While this will (naturally) lead to slightly higher peak memory usage, it'll however be limited to the current `Parser`-instance which means that it's not persistent. One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images.	2022-11-10 18:27:26 +01:00
Jonas Jenwald	f7449563ef	Merge pull request #15659 from sxyuan/system-font-name-fix [api-minor] Propagate the translated font name to TextContentItem for system fonts	2022-11-08 21:56:49 +01:00
Samuel Yuan	36fb5c1e2b	Propagate the translated font name to TextContentItems. This allows font data for system fonts to be looked up in the PDFObjects.	2022-11-08 11:16:21 -08:00
Jonas Jenwald	c8868a1c7a	[api-minor] Initialize the unicode-category lazily on the `Glyph`-instance The purpose of this patch is twofold: - Initialize the unicode-category data lazily during text-extraction, since this is completely unused during general parsing/rendering. - Stop exposing this data in the API, since it's unused on the main-thread and it seems like it was accidentally included. Obviously these changes are API-observable, but hopefully no user is depending on this. Furthermore, it's trivial for a user to re-create this unicode-category data manually with a regular expression (from the exposed `unicode` property).	2022-11-05 10:12:17 +01:00
Jonas Jenwald	c33b8d7692	Cache the normalized unicode-value on the `Glyph`-instance Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it lazily initialized. Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be. Please note: The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.	2022-11-03 22:36:53 +01:00
Jonas Jenwald	23930a249e	[api-minor] Let `Catalog.getAllPageDicts` return an empty dictionary when loading the first /Page fails (issue 15590) In order to support opening certain corrupt PDF documents, particularly hand-edited ones, this patch adds support for letting the `Catalog.getAllPageDicts` method fallback to returning an empty dictionary to replace (only) the first /Page of the document. Given that the viewer cannot initialize/load without access to the first page, this will thus allow e.g. document-level scripting to run as expected. Note that by effectively replacing a corrupt or missing first /Page in this way[1], we'll now render nothing but a blank page for certain cases of broken/corrupt PDF documents which may look weird. Please note: This functionality is controlled via the existing `stopAtErrors` option, that can be passed to `getDocument`, since it's easy to imagine use-cases where this sort of fallback behaviour isn't desirable. --- [1] Currently we still require that a /Pages-dictionary is found though, however it may be possible to relax even that assumption if that becomes absolutely necessary in future corrupt documents.	2022-11-03 12:51:48 +01:00
Jonas Jenwald	2516ffa78e	Fallback to finding the first "obj" occurrence, when the trailer-dictionary is incomplete (issue 15590) Note that the "trailer"-case is already a fallback, since normally we're able to use the "xref"-operator even in corrupt documents. However, when a "trailer"-operator is found we still expect "startxref" to exist and be usable in order to advance the stream position. When that's not the case, as happens in the referenced issue, we use a simple fallback to find the first "obj" occurrence instead. This partially fixes issue 15590, since without this patch we fail to find any objects at all during `XRef.indexObjects`. However, note that the PDF document is still corrupt and won't render since there's no actual /Pages-dictionary and the /Root-entry simply points to the /OpenAction-dictionary instead.	2022-11-03 12:46:30 +01:00
calixteman	e42e1cde61	Merge pull request #15615 from calixteman/bug1796741 [Form] Don't use field appearances when /NeedAppearances is set to true (bug 1796741)	2022-10-31 09:58:27 +01:00
Jonas Jenwald	caef47a0cf	Remove the `PdfManager.onLoadedStream` method (PR 15616 follow-up) After the clean-up in PR 15616, the `PdfManager.onLoadedStream` method now only has a single call-site. Hence why this patch suggests that we remove this method and replace it with an optional parameter in `PdfManager.requestLoadedStream` instead. By making the new behaviour opt-in, we'll thus not change any existing call-site.	2022-10-29 14:42:17 +02:00
Jonas Jenwald	8b970109ea	Merge pull request #15632 from Snuffleupagus/issue-15629-2 [api-minor] Move the handling of unbalanced markedContent to the worker-thread (PR 15630 follow-up)	2022-10-29 09:37:07 +02:00
Jonas Jenwald	ba05e47b3e	Combine `Array.from` and `Array.prototype.map` calls This isn't just a tiny bit more compact, but it also avoids an intermediate allocation; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/from#description	2022-10-28 13:46:30 +02:00
Jonas Jenwald	1e7274e9c6	[api-minor] Move the handling of unbalanced markedContent to the worker-thread (PR 15630 follow-up)	2022-10-27 11:14:54 +02:00
Calixte Denizet	9f95a14e91	[Form] Don't use field appearances when /NeedAppearances is set to true (bug 1796741) When a form isn't changed, we used the appearances we had in the file, but when /NeedAppearances is true, all the appearances have to be regenerated whatever they're.	2022-10-26 12:10:51 +02:00
Jonas Jenwald	bcffbf74f3	Let the `PdfManager.requestLoadedStream` method return the stream This is very old code, and it could thus do with some simplification. Note how in the `src/core/worker.js` file we're combining both the `PdfManager.requestLoadedStream` and `PdfManager.onLoadedStream` methods in order to access the stream-data. This seems unnecessary, and it's simple enough to always let the `PdfManager.requestLoadedStream` method return the stream-data as well.	2022-10-24 17:00:48 +02:00
Jonas Jenwald	71bd8b4de9	Let `Lexer.getNumber` treat more invalid "numbers" as zero (issue 15604) In the referenced PDF document there are "numbers" which consist only of `-.`, and while that's obviously not valid Adobe Reader seems to handle it just fine. Letting this method ignore more invalid "numbers" was suggested during the review of PR 14543, so let's simply relax our the validation here.	2022-10-20 22:36:15 +02:00
Jonas Jenwald	e591378ff1	Restore a weaker version of the /Pages dictionary /Count check for corrupt documents (PR 15593 follow-up) It appears that PR 15593 broke `issue12402`, and we thus need to partially restore the /Count check. I completely missed this when looking at the test-results for PR 15593, both locally and on the bots, since the `Driver._getLastPageNumber` method would "swallow" an unavailable page number.	2022-10-20 14:22:29 +02:00
Jonas Jenwald	36967fcedb	Merge pull request #15586 from Snuffleupagus/rm-matchesForCache Remove the `Glyph.matchesForCache` method (PR 13494 follow-up)	2022-10-20 10:35:00 +02:00
Jonas Jenwald	3c046c0a21	Extend `getSupplementalGlyphMapForCalibri` with some umlauts (issue 15594)	2022-10-19 17:49:40 +02:00
Jonas Jenwald	bc13a277ce	Relax the /Pages dictionary /Count check for corrupt documents (issue 9105) After PR 14311, and follow-up patches, we no longer require that the /Count entry (in the /Pages dictionary) is either present or even valid in order to parse/render a PDF document. Hence it seems strange to keep this requirement for corrupt PDF documents, when trying to find a usable `trailer` in the `XRef.indexObjects` method.	2022-10-19 12:28:25 +02:00
Jonas Jenwald	fd35cda8bc	Re-factor the glyph-cache lookup in the `Font._charToGlyph` method With the changes in the previous patch we can move the glyph-cache lookup to the top of the method and thus avoid a bunch of, in almost every case, completely unnecessary re-parsing for every `charCode`.	2022-10-19 09:55:09 +02:00
Jonas Jenwald	3e391aaed9	Remove the `Glyph.matchesForCache` method (PR 13494 follow-up) This method, and its class, was originally added in PR 4453 to reduce memory usage when parsing text. Then PR 13494 extended the `Glyph`-representation slightly to also include the `charCode`, which made the `matchesForCache` method effectively redundant since most properties on a `Glyph`-instance indirectly depends on that one. The only exception is potentially `isSpace` in multi-byte strings. Also, something that I noticed when testing this code: The `matchesForCache` method never worked correctly for `Glyph`s containing `accent`-data, since Objects are passed by reference in JavaScript. For affected fonts, of which there's only a handful of examples in our test-suite, we'd fail to find an already existing `Glyph` because of this.	2022-10-19 09:54:35 +02:00
Jonas Jenwald	de99f99a01	Fallback and try a previous generation if all else fails in `XRef.indexObjects` (issue 15577) When we fail to find a usable PDF document `trailer` and there were errors during parsing, try and fallback to a previous generation as a last resort during fetching of uncompressed references. Please note: This will not affect "normal" PDF documents, with valid /XRef data, and even most corrupt documents should be completely unaffected by these changes.	2022-10-18 20:24:01 +02:00
Tim van der Meij	06599f487f	Merge pull request #15576 from Snuffleupagus/version Re-factor the PDF version parsing in the worker-thread	2022-10-15 13:03:43 +02:00
Tim van der Meij	2508792f29	Merge pull request #15572 from Snuffleupagus/simpleFontToUnicode-refactor Slightly re-factor `PartialEvaluator._simpleFontToUnicode`	2022-10-15 12:31:27 +02:00
Jonas Jenwald	d470010293	Re-factor the PDF version parsing in the worker-thread Part of this is very old code, and back when support for parsing the catalog-version was added things became less clear (in my opinion). Hence this patch tries to improve things, by e.g. validating the header- and catalog-version separately.	2022-10-15 12:06:39 +02:00
Jonas Jenwald	15d4d80d45	Merge pull request #15563 from Snuffleupagus/issue-15559 Take the /CIDToGIDMap into account when getting the glyph mapping for CFF fonts (issue 15559)	2022-10-14 09:13:41 +02:00
Jonas Jenwald	fa47d4b9b1	Slightly re-factor `PartialEvaluator._simpleFontToUnicode` Given the sheer number of heuristics added to this method over the years, moving the valid unicode found case to the top should improve readability of the code.	2022-10-13 21:42:57 +02:00
Jonas Jenwald	f2f0a1e871	[api-minor] Stop sending "UnsupportedFeature" from the worker-thread GetOperatorList-handling This code was added all the way back in PR 6698, almost seven years ago, for backwards compatibility reasons. At this point in time, it seems that we can remove that since: - We have more fine-grained "UnsupportedFeature" reporting elsewhere in the worker-thread code nowadays. - The GetOperatorList-handling is now using `ReadableStream`s, which means that errors are being forwarded to the main-thread anyway. - We're also no longer displaying a notification-bar, in the built-in Firefox PDF Viewer, for any of these "UnsupportedFeature" messages.	2022-10-13 11:46:17 +02:00
Jonas Jenwald	858d941ff8	Take the /CIDToGIDMap into account when getting the glyph mapping for CFF fonts (issue 15559) Please note: I don't really know what I'm doing here, however the patch appears to fix the referenced issue when comparing the rendering with Adobe Reader (with the caveat that I don't speak the language in question).	2022-10-13 10:02:25 +02:00
Jonas Jenwald	5bc6f964db	Slightly re-factor the version fetching in `PDFDocument.checkHeader` Note how after having found the "%PDF-" prefix we then read both the prefix and the version in the loop, only to then remove the prefix at the end. It seems better to instead advance the stream position past the "%PDF-" prefix, and then read only the version data. Finally the loop-condition can also be simplified slightly, to further clean-up some very old code.	2022-10-11 13:15:01 +02:00
Jonas Jenwald	081e897588	Ensure that `Page.getOperatorList` handles Annotation parsing errors correctly (issue 15557) Fixes a regression from PR 15246, sorry about that! The return value of all `Annotation.getOperatorList` methods was changed in PR 15246, however I missed updating the error code-path in `Page.getOperatorList` which thus breaks all operatorList-parsing for pages with corrupt Annotations.	2022-10-10 09:48:01 +02:00
Tim van der Meij	dff444d441	Merge pull request #15555 from Snuffleupagus/improve-GetDocRequest Clean-up the data that we're sending with "GetDocRequest"	2022-10-09 14:10:44 +02:00
Jonas Jenwald	8a4f6aca97	Stop using the `source`-object when sending "GetDocRequest" Looking at the code on the worker-thread, there doesn't appear to be any particular reason for placing some of the properties in a `source`-object when sending them with "GetDocRequest". As is often the case the explanation for this structure is rather "for historical reasons", since originally we simply sent the `source`-object as-is. Doing that was obviously a bad idea, for a couple of reasons: - It makes it less clear what is/isn't actually needed on the worker-thread. - Sending unused properties will unnecessarily increase memory usage. - The `source`-object may contain unclonable data, which would break the library.	2022-10-09 12:45:24 +02:00
Jonas Jenwald	c84b717773	Group the `evaluatorOptions` on the main-thread, when sending "GetDocRequest" Rather than sending all of these parameters individually and then grouping them together on the worker-thread, we can simply handle that in the API instead.	2022-10-09 12:31:03 +02:00
Jonas Jenwald	4cc98de6d7	Remove the unused `CMapCompressionType.STREAM` value This was added in PR 8064, over five years ago, for a possible future CMap file-format that was never implemented.	2022-10-08 17:10:05 +02:00
Calixte Denizet	c0e165bf97	Simplify the way to compute the remainder modulo 3 in PDF20Hash function I noticed the 256 % 3 (which is equal to 1) so I slighty simplify the code. The sum of the 16 Uint8 doesn't exceed 2^12, hence we can just take the sum modulo 3.	2022-10-07 14:43:31 +02:00
Jonas Jenwald	3cb119cb32	Merge pull request #15539 from Snuffleupagus/DecryptStream-set Replace loop with `TypedArray.prototype.set` in the `DecryptStream.readBlock` method	2022-10-07 11:14:28 +02:00

1 2 3 4 5 ...

2677 Commits