Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	a01c599247	Cache the "raw" standard font data in the worker-thread (PR 12726 follow-up) This implementation is basically a copy of the pre-existing `builtInCMapCache` implementation. For some, badly generated, PDF documents it's possible that we'll end up having to fetch the same standard font data over and over (which is obviously inefficient). While not common, it's certainly possible that a PDF document uses custom font names where the actual font then references one of the standard fonts; see e.g. issue 11399 for one such example. Note that I did suggest adding worker-thread caching of standard font data in PR 12726, however it wasn't deemed necessary at the time. Now that we have a real-world example that benefit from caching, I think that we should simply implement this now.	2021-06-09 18:27:51 +02:00
Calixte Denizet	34a2fa72c7	XFA - Add Liberation-Sans font as a substitution for some missing fonts - Some js files contain scale factors for each glyph in order to rescale Liberation to have a final font with the correct width. - A lot of XFA have some containers where their dimensions are based on their text content, so using default font from browser can lead to an almost unreadable pdf.	2021-06-09 16:55:45 +02:00
calixteman	8c53bf8647	Merge pull request #13437 from calixteman/xfa_mv_root XFA - Move the fake HTML representation of XFA from the worker to the main thread	2021-05-31 10:14:15 +02:00
Calixte Denizet	1b0006093d	Italic angle is defined clockwise in CSS when it's counterclockwise in PDF	2021-05-28 11:06:11 +02:00
Calixte Denizet	45c3f00a27	XFA - Move the fake HTML representation of XFA from the worker to the main thread - the only goal of this patch is to be able to get synchronously the fake html when printing from firefox: - in order to print we need to inject some html in beforeprint callback but we cannot block in waiting for all the pages. - from a memory point of view: it doesn't change anything since the fake HTML is deleted in the worker; - this way we don't break any assumptions.	2021-05-25 19:33:07 +02:00
Jonas Jenwald	1a8d05fdcf	Remove some, with Prettier `2.3.0`, unnecessary `// prettier-ignore` comments To get the maximum benefit from something like Prettier, you obviously don't want to disable the automatic formatting unless absolutely necessary. When we added Prettier there were a number of cases, mostly involving larger Arrays, which required disabling of the automatic formatting for overall readability and/or to not break inline comments. With changes in Prettier version `2.3.0`, see [the release notes](https://prettier.io/blog/2021/05/09/2.3.0.html#concise-formatting-of-number-only-arrays-10106httpsgithubcomprettierprettierpull10106-10160httpsgithubcomprettierprettierpull10160-by-thorn0httpsgithubcomthorn0), there's now better formatting support for Arrays containing only numbers. Hence we can now remove a number of `// prettier-ignore` comments, and thus get the benefit of automatic formatting in (slightly) more of the code-base.	2021-05-19 11:36:03 +02:00
Brendan Dahl	17e9cfcd2a	Merge pull request #13328 from calixteman/js_display1 JS - Add support for display property	2021-05-17 08:47:13 -07:00
Jonas Jenwald	4248f0745c	Improve the `Page.content` and `Page.getContentStream` methods First of all, by using `Dict.getArray` in the `Page.content` getter we remove the need to manually iterate through and fetch the sub-streams (when they exist) in the `Page.getContentStream` method. Secondly, we can simplify the code in `Page.{getOperatorList, extractTextContent}` by letting `Page.getContentStream` ensure that `content` is available and returning a Promise instead.	2021-05-14 11:47:34 +02:00
Calixte Denizet	af125cd299	JS - Add support for display property - in annotation_layer, move common properties treatment in a common method instead having duplicated code in each widget.	2021-05-06 11:15:38 +02:00
Jonas Jenwald	3624f9eac7	Add a new `BaseStream.getString(...)` method to replace manual `bytesToString(BaseStream.getBytes(...))` calls Given that the `bytesToString(BaseStream.getBytes(...))` pattern is somewhat common throughout the `src/core/` code, it cannot hurt to add a new `BaseStream`-method which handles that case internally.	2021-05-01 19:20:36 +02:00
Jonas Jenwald	30a22a168d	Move the `DecodeStream` and `StreamsSequenceStream` from `src/core/stream.js` and into its own file	2021-04-28 10:16:51 +02:00
Jonas Jenwald	8f6543c218	Ensure that the /Properties, used with optional content, is actually loaded before parsing the operatorList/textContent (PR 12095 follow-up) By not waiting for the /Properties to load, before parsing of the operatorList/textContent starts, there's a very real risk that a `MissingDataException` will be thrown when trying to access the data in the `PartialEvaluator.parseMarkedContentProps` method. If this ever happens it will thus lead to incomplete and/or outright broken rendering, and with e.g. `disableAutoFetch=true` set the likelihood of this occuring would increase quite a bit. Please note: While I've not yet seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server.	2021-04-20 20:22:44 +02:00
Jonas Jenwald	f560fe6875	A couple of small scripting/XFA-related tweaks in the worker-code - Use `PDFManager.ensureDoc`, rather than `PDFManager.ensure`, in a couple of spots in the code. If there exists a short-hand format, we should obviously use it whenever possible. - Fix a unit-test helper, to account for the previous changes. (Also, converts a function to be `async` instead.) - Add one more exists-check in `PDFDocument.loadXfaFonts`, which I missed to suggest in PR 13146, to prevent any possible errors if the method is ever called in a situation where it shouldn't be. Also, print a warning if the actual font-loading fails since that could help future debugging. (Finally, reduce overall indentation in the loop.) - Slightly unrelated, but make a small tweak of a comment in `src/core/fonts.js` to reduce possible confusion.	2021-04-17 10:34:22 +02:00
Brendan Dahl	ac3fa1e3d7	Merge pull request #13146 from calixteman/xfa_fonts XFA -- Load fonts permanently from the pdf	2021-04-16 12:55:12 -07:00
Calixte Denizet	7e9579045f	XFA -- Load fonts permanently from the pdf - Different fonts can be used in xfa and some of them are embedded in the pdf. - Load all the fonts in window.document. Update src/core/document.js Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com> Update src/core/worker.js Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>	2021-04-15 17:57:42 +02:00
Jonas Jenwald	1d6d476cab	Rename the `src/core/obj.js` file to `src/core/catalog.js` Now that only the `Catalog` remains in this file, after the previous patches, it makes sense to rename the file to reduce confusion.	2021-04-13 21:00:30 +02:00
Jonas Jenwald	e8750cfe95	Move the `XRef` from `src/core/obj.js` and into its own file The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of distinct functionality. In order to improve readability and make it easier to navigate through the code, this patch moves the `XRef` into its own file.	2021-04-13 21:00:30 +02:00
Jonas Jenwald	604cd6d600	Move the `ObjectLoader` from `src/core/obj.js` and into its own file The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of distinct functionality. In order to improve readability and make it easier to navigate through the code, this patch moves the `ObjectLoader` into its own file.	2021-04-13 21:00:30 +02:00
Tim van der Meij	ebeb3f7999	Merge pull request #13234 from Snuffleupagus/hasJSActions-MissingDataException [api-minor] Ensure that `PDFDocumentProxy.hasJSActions` won't fail if `MissingDataException`s are thrown during the associated worker-thread parsing	2021-04-13 20:44:58 +02:00
Jonas Jenwald	2b2234fd5a	[api-minor] Ensure that `PDFDocumentProxy.hasJSActions` won't fail if `MissingDataException`s are thrown during the associated worker-thread parsing With the current implementation of `PDFDocument.hasJSActions`, in the worker-thread, we're not actually handling not-yet-loaded data correctly. This can thus fail in two different ways: - The `PDFDocument.fieldObjects` getter (and its helper method), while it may return a Promise, still fetches all of its data synchronously and it can thus throw a `MissingDataException` during parsing. - The `Catalog.jsActions` getter, which is completely synchronous, can obviously throw a `MissingDataException` during parsing. If either of these cases occur currently, the `PDFDocumentProxy.hasJSActions` method in the API can either return a rejected Promise (which it never should) or possibly "hang" and never resolve. Please note: While I've not yet seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server. This patch is thus based on code-inspection and on manually throwing a `MissingDataException` on the first access of `Catalog.jsActions` to simulate this situation. Finally, this patch adds a couple of API unit-tests for this (since none existed).	2021-04-13 14:33:56 +02:00
Jonas Jenwald	4aa27cc645	Re-factor `Catalog._collectJavaScript` to use a `Map` rather than an Object Given that this only an internal helper method, used by the `Catalog.{javaScript, jsActions}` getters, this change simplifies iteration of the returned data. We can also (slightly) re-factor the code of the `jsActions` getter, and remove an obsolete[1] JSDoc-comment from the `openAction` getter. --- [1] Not really relevant now that we've got proper scripting support.	2021-04-13 14:16:17 +02:00
Jonas Jenwald	9360c7cbdc	Avoid unnecessary parsing, in `Page.GetStructTree`, when no structTree is available (PR 13221 follow-up) It's obviously (a bit) more efficient to return early in `Page.getStructTree`, rather than trying to first "parse" an empty structTree-root. Somehow I didn't think of this yesterday, but this feels like a much better solution overall; sorry about the churn here!	2021-04-12 08:54:21 +02:00
Jonas Jenwald	ff4dae05b0	Ensure that `getStructTree` won't break with `disableAutoFetch = true` set (PR 13171 follow-up) Open http://localhost:8888/web/viewer.html?file=/test/pdfs/pdf.pdf#disableStream=true&disableAutoFetch=true and observe the following message in the console (repeated for each page of the document): ``` Uncaught (in promise) Object { message: "Missing data [19787293, 19787294)", name: "UnknownErrorException", details: "MissingDataException: Missing data [19787293, 19787294)", stack: "BaseExceptionClosure@http://localhost:8888/src/shared/util.js:458:29\n@http://localhost:8888/src/shared/util.js:462:3\n" } ```	2021-04-11 12:15:33 +02:00
Tim van der Meij	d9d626a5e1	Merge pull request #13214 from calixteman/signatures Display widget signature	2021-04-10 19:35:16 +02:00
Calixte Denizet	5875ebb1ca	Display widget signature - but don't validate them for now; - Firefox will display a bar to warn that the signature validation is not supported (see https://bugzilla.mozilla.org/show_bug.cgi?id=854315) - almost all (all ?) pdf readers display signatures; - validation is done in edge but for now it's behind a pref.	2021-04-10 19:13:28 +02:00
Brendan Dahl	fc9501a637	Add support for basic structure tree for accessibility. When a PDF is "marked" we now generate a separate DOM that represents the structure tree from the PDF. This DOM is inserted into the <canvas> element and allows screen readers to walk the tree and have more information about headings, images, links, etc. To link the structure tree DOM (which is empty) to the text layer aria-owns is used. This required modifying the text layer creation so that marked items are now tracked.	2021-04-09 09:56:28 -07:00
calixteman	84d7cccb1d	JS - Handle correctly hierarchy of fields (#13133 ) * JS - Handle correctly hierarchy of fields - it aims to fix #13132; - annotations can inherit their actions from the parent field; - there are some fields which act as a container for other fields: - they can be access through js so need to add them with an empty type (nothing in the spec about that but checked in Acrobat); - calculation order list (CO) can reference them so need make them through this.getField; - getArray method must return kids. - field values are number, string, ... depending of their type but nothing in the spec on how to know what's the type: - according to the comment for Canonical Format: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=461 - it seems that this "type" can be guessed from js action Format (when setting a type in Acrobat DC, the only affected thing is this action). - util.scand with an empty string returns the current date.	2021-03-30 08:50:35 -07:00
calixteman	24e598a895	XFA - Add a layer to display XFA forms (#13069 ) - add an option to enable XFA rendering if any; - for now, let the canvas layer: it could be useful to implement XFAF forms (embedded pdf in xml stream for the background and xfa form for the foreground); - ui elements in template DOM are pretty close to their html counterpart so we generate a fake html DOM from template one: - it makes easier to translate template properties to html ones; - it makes faster the creation of the html element in the main thread.	2021-03-19 10:11:40 +01:00
Calixte Denizet	ffd4bc790c	JS -- Add tests for print/save actions * change PDFDocument::hasJSActions to return true when there are JS actions in catalog.	2020-12-24 18:51:00 +01:00
Calixte Denizet	1e2173f038	JS - Collect and execute actions at doc and pages level * the goal is to execute actions like Open or OpenAction * can be tested with issue6106.pdf (auto-print) * once #12701 is merged, we can add page actions	2020-12-18 20:03:59 +01:00
Calixte Denizet	03814bd6a2	Don't use 'in' operator to check if key is in a Map	2020-12-16 16:00:12 +01:00
Tim van der Meij	00b4f86db3	Merge pull request #12717 from Snuffleupagus/issue-12714 Ensure that the /Annots-entry, on /Page-instances, is actually an Array (issue 12714)	2020-12-10 23:06:59 +01:00
Calixte Denizet	25bf504ff5	Be sure that CalculationOrder is either null or a non-empty array	2020-12-10 16:02:11 +01:00
Jonas Jenwald	796a0d3155	Ensure that the /Annots-entry, on /Page-instances, is actually an Array (issue 12714) In the referenced PDF document, the second and third page has corrupt /Annots-entries which contain /Dict-data rather than the intended Arrays.	2020-12-10 11:42:00 +01:00
Calixte Denizet	b11592a756	JS -- hidden annotations must be built in case a script show them * in some pdf, there are actions with "event.source.hidden = ..." * in order to handle visibility when printing, annotationStorage is extended to store multiple properties (value, hidden, editable, ...)	2020-11-10 12:48:34 +01:00
Calixte Denizet	a5279897a7	JS -- Add listener for sandbox events only if there are some actions * When no actions then set it to null instead of empty object * Even if a field has no actions, it needs to listen to events from the sandbox in order to be updated if an action changes something in it.	2020-11-09 18:37:59 +01:00
Jonas Jenwald	082cd8fc6c	Add global caching, for /Resources without blend modes, and use it to reduce repeated fetching/parsing in `PartialEvaluator.hasBlendModes` The `PartialEvaluator.hasBlendModes` method is necessary to determine if there's any blend modes on a page, which unfortunately requires synchronous parsing of the /Resources of each page before its rendering can start (see the "StartRenderPage"-message). In practice it's not uncommon for certain /Resources-entries to be found on more than one page (referenced via the XRef-table), which thus leads to unnecessary re-fetching/re-parsing of data in `PartialEvaluator.hasBlendModes`. To improve performance, especially in pathological cases, we can cache /Resources-entries when it's absolutely clear that they do not contain any blend modes at all[1]. This way, subsequent `PartialEvaluator.hasBlendModes` calls can be made significantly more efficient. This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf: ``` [ { "id": "issue6961", "file": "../web/pdfs/issue6961.pdf", "md5": "a80e4357a8fda758d96c2c76f2980b03", "rounds": 100, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, page, stat -- browser \| page \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ---- \| ------------ \| ----- \| ------------ \| ----------- \| ---- \| ------ \| ------------- firefox \| 0 \| Overall \| 100 \| 1034 \| 555 \| -480 \| -46.39 \| faster firefox \| 0 \| Page Request \| 100 \| 489 \| 7 \| -482 \| -98.67 \| faster firefox \| 0 \| Rendering \| 100 \| 545 \| 548 \| 2 \| 0.45 \| firefox \| 1 \| Overall \| 100 \| 912 \| 428 \| -484 \| -53.06 \| faster firefox \| 1 \| Page Request \| 100 \| 487 \| 1 \| -486 \| -99.77 \| faster firefox \| 1 \| Rendering \| 100 \| 425 \| 427 \| 2 \| 0.51 \| ``` --- [1] In the case where blend modes are found, it becomes a lot more difficult to know if it's generally safe to skip /Resources-entries. Hence we don't cache anything in that case, however note that most document/pages do not utilize blend modes anyway.	2020-11-05 16:59:08 +01:00
Jonas Jenwald	b44a975d7c	Prevent issues, in `PDFDocument.fieldObjects`, for invalid Annotations For an invalid Annotation, there's one code-path where `undefined` is returned from `AnnotationFactory._create`. That'd currently, incorrectly, trigger an error during the `PDFDocument._collectFieldObjects` parsing which thus seem good to avoid. Along these lines, the filtering in `PDFDocument.fieldObjects` is also updated to handle both `null` and `undefined` the same way.	2020-10-22 13:24:43 +02:00
Calixte Denizet	c30a3a94f0	JS - Add a function in api to get the fields ids in AcroForm::CO	2020-10-17 12:56:40 +02:00
Jonas Jenwald	29af15f37e	Add more validation in the `PDFDocument._hasOnlyDocumentSignatures` method If this method is ever passed invalid/unexpected data, or if during the course of parsing (since it's used recursively) such data is found, it will fail in a non-graceful way. Hence this patch, which ensures that we don't attempt to access non-existent properties and also that errors such as the one fixed in PR 12479 wouldn't have occured.	2020-10-16 13:03:47 +02:00
Jonas Jenwald	3351d3476d	Don't store complex data in `PDFDocument.formInfo`, and replace the `fields` object with a `hasFields` boolean instead This patch is based on a couple of smaller things that I noticed when working on PR 12479. - Don't store the /Fields on the `formInfo` getter, since that feels like overloading it with unintended (and too complex) data, and utilize a `hasFields` boolean instead. This functionality was originally added in PR 12271, to help determine what kind of form data a PDF document contains, and I think that we should ensure that the return value of `formInfo` only consists of "simple" data. With these changes the `fieldObjects` getter instead has to look-up the /Fields manually, however that shouldn't be a problem since the access is guarded by a `formInfo.hasFields` check which ensures that the data both exists and is valid. Furthermore, most documents doesn't even have any /AcroForm data anyway. - Determine the `hasFields` property first, to ensure that it's always correct even if there's errors when checking e.g. the /XFA or /SigFlags entires, since the `fieldObjects` getter depends on it. - Simplify a loop in `fieldObjects`, since the object being accessed is a `Map` and those have built-in iteration support. - Use a higher logging level for errors in the `formInfo` getter, and include the actual error message, since that'd have helped with fixing PR 12479 a lot quicker. - Update the JSDoc comment in `src/display/api.js` to list the return values correctly, and also slightly extend/improve the description.	2020-10-16 12:47:27 +02:00
Calixte Denizet	71ecc3129b	Add the possibility to collect Javascript actions	2020-10-14 10:44:16 +02:00
Jonas Jenwald	9416b14e8b	Re-factor how the ESLint `no-var` rule is enabled in the `src/` folder This simplifies/consolidates the ESLint configuration slightly in the `src/` folder, and prevents the addition of any new files where `var` is being used.[1] Hence we no longer need to manually add `/* eslint no-var: error */` in files, which is easy to forget, and can instead disable the rule in the `src/core/` files where `var` is still in use. --- [1] Obviously the `no-var` rule can, in the same way as every other rule, be disabled on a case-by-case basis where actually necessary.	2020-10-03 20:15:29 +02:00
Jonas Jenwald	784a420027	Add support, in `Dict.merge`, for merging of "sub"-dictionaries This allows for merging of dictionaries one level deeper than previously. This could be useful e.g. for /Resources dictionaries, where you want to e.g. merge their respective /Font dictionaries (and other) together rather than picking just the first one.	2020-08-30 23:18:32 +02:00
Tim van der Meij	0f229d537f	Inline the `setup` method in the `parse` method in `src/core/document.js` Now that the `parse` method is simplified we can inline the `setup` method in the `parse` method since it's only two lines of code. This avoids some indirection.	2020-08-25 23:28:55 +02:00
Tim van der Meij	280207c740	Redo the form type detection logic and include unit tests Good form type detection is important to get reliable telemetry and to only show the fallback bar if a form cannot be filled out by the user. PDF.js only supports AcroForm data, so XFA data is explicitly unsupported (tracked in issue #2373). However, the previous form type detection couldn't separate AcroForm and XFA well enough, causing form type telemetry to be incorrect sometimes and the fallback bar to be shown for forms that could in fact be filled out by the user. The solution in this commit is found by studying the specification and the form documents that are available to us. In a nutshell the rules are: - There is XFA data if the `XFA` entry is a non-empty array or stream. - There is AcroForm data if the `Fields` entry is a non-empty array and it doesn't consist of only document signatures. The document signatures part was not handled in the old code, causing a document with only XFA data to also be marked as having AcroForm data. Moreover, the old code didn't check all the data types. Now that AcroForm and XFA can be distinguished, the viewer is configured to only show the fallback bar for documents that only have XFA data. If a document also has AcroForm data, the viewer can use that to render the form. We have not found documents where the XFA data was necessary in that case. Finally, we include unit tests to ensure that all cases are covered and move the form type detection out of the `parse` function so that it's only executed if the document information is actually requested (potentially making initial parsing a tiny bit faster).	2020-08-25 23:28:55 +02:00
Tim van der Meij	f20f0bcc78	Move the AcroForm logic from the document to the catalog The `AcroForm` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, make the AcroForm member private on the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `IsAcroFormPresent`. Only a boolean is exposed, so we now also only store the boolean on the document instance. Finally, the annotation code needs access to the full AcroForm dictionary, so it's updated to fetch the data from the catalog instead of the document that now only holds the boolean.	2020-08-25 23:28:55 +02:00
Tim van der Meij	b41a2f4d5a	Move the collection logic from the document to the catalog The `Collection` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, remove the collection member from the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `IsCollectionPresent`. Moving this out of the `parse` function makes sure that the getter is only executed if the document information is actually requested (potentially making initial parsing a tiny bit faster).	2020-08-25 23:28:55 +02:00
Tim van der Meij	935d95b462	Move the version logic from the document to the catalog The `Version` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, make the version member private on the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `PDFFormatVersion`. Finally, clarify how the version from the header and the version from the catalog are treated using a comment.	2020-08-25 23:28:55 +02:00
Calixte Denizet	1a6816ba98	Add support for saving forms	2020-08-12 10:32:59 +02:00

1 2 3