pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	d0c4bbd828	[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) This patch basically extends the approach from PR 10392, by also checking the last page. Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid. As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser). Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages. To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages. Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug. - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents. - For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents, - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost. As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value). Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.	2021-11-27 21:57:35 +01:00
Tim van der Meij	9a1e27efc5	Merge pull request #14313 from Snuffleupagus/PDFDocument_pagePromises-map Change the `_pagePromises` cache, in the worker, from an Array to a Map	2021-11-27 20:58:23 +01:00
calixteman	bbd8b5ce9f	Merge pull request #14319 from calixteman/xfa_arc XFA - Draw arcs correctly	2021-11-27 11:32:32 -08:00
Calixte Denizet	31e13515f5	XFA - Draw arcs correctly - it aims to fix #14315; - take into account the startAngle to compute the coordinates of the final point.	2021-11-27 19:30:12 +01:00
Calixte Denizet	cfdaa57353	Handle sub/super-scripts in rich text - it aims to fix #14317; - change the fontSize and the verticalAlign properties according to the position of the text.	2021-11-27 16:06:09 +01:00
Jonas Jenwald	4c56214ab4	Convert `PDFDocument._getLinearizationPage` to an async method This, ever so slightly, simplifies the code and reduces overall indentation.	2021-11-26 19:57:47 +01:00
Jonas Jenwald	080996ac68	Change the `_pagePromises` cache, in the worker, from an Array to a Map Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here. Furthermore, this patch also adds (currently missing) caching for XFA-documents. Loading a couple of such documents in the viewer, with logging added, shows that we're currently re-creating `Page`-instances unnecessarily for XFA-documents.	2021-11-26 19:53:57 +01:00
Jonas Jenwald	ca8d2bdce4	Abort parsing when the XRef /W-array contain bogus entries (issue 14303) For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense. While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser). Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now. Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.	2021-11-25 18:35:08 +01:00
Jonas Jenwald	ae4f1ae3e7	Ensure that `ChunkedStream` won't attempt to request data beyond the document size (issue 14303) This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test). The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from beyond the actual document size and thus getting into an infinite loop. Fixes one of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.	2021-11-24 19:19:43 +01:00
Jonas Jenwald	6da0944fc7	[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter Please note: These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents. The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for every rendered page. This patch proposes replacing that method with a synchronous `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and only send them to the main-thread the first time that a type is encountered. Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1] This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes. Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return the same identical object. This is something that we can easily take advantage of in the default viewer, by now only reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents). --- [1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see `41ac3f0c07/src/shared/util.js (L206-L232)` [2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread. [3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code. In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549	2021-11-20 12:20:55 +01:00
Tim van der Meij	41ac3f0c07	Merge pull request #14291 from Snuffleupagus/force-postMessageTransfers [api-minor] Only use Workers when `postMessage` transfers are supported (PR 11123 follow-up)	2021-11-19 20:02:51 +01:00
Brendan Dahl	c6cb39ef30	Merge pull request #14262 from Snuffleupagus/issue-14261 Include the /Lang-property, when it exists, in the StructTree-data (issue 14261)	2021-11-19 07:51:21 -08:00
Jonas Jenwald	6f22327e61	[api-minor] Only use Workers when `postMessage` transfers are supported (PR 11123 follow-up) Given that all modern browsers now support `postMessage` transfers, and have for years, it no longer seems necessary for the PDF.js library to support using Workers unless the `postMessage` transfers functionality is available. This patch is a follow-up to PR 11123, which made it impossible to manually disable `postMessage` transfers for performance reasons (since it increases memory usage), which hasn't caused any bug reports as far as I know.[1] Hence we'll now only support proper Worker implementations, with fully working `postMessage` transfers, and fallback to using "fake" Workers otherwise. --- [1] At the time of that PR we still "supported" IE, which is why this code was left intact.	2021-11-19 16:47:58 +01:00
Brendan Dahl	3209c013c4	Merge pull request #14247 from calixteman/button [api-minor] Render pushbuttons on their own canvas (bug 1737260)	2021-11-16 08:10:40 -08:00
Jonas Jenwald	971ac8e993	Include the /Lang-property, when it exists, in the StructTree-data (issue 14261) Please note: This is a tentative patch, since I don't have the necessary a11y-software to actually test it.	2021-11-14 12:37:41 +01:00
Jonas Jenwald	a54bed4963	Enable the ESLint `no-loss-of-precision` rule Please refer to https://eslint.org/docs/rules/no-loss-of-precision	2021-11-14 10:48:50 +01:00
Jonas Jenwald	afcc99a86d	When parsing corrupt documents without any trailer-dictionary, fallback to the "top"-dictionary (issue 14269) There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens. Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's really corrupt, however we're at least able to render something with this patch. --- [1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.	2021-11-13 13:21:38 +01:00
Jonas Jenwald	28fb3975eb	Merge pull request #14266 from calixteman/bug931481 Don't consider space as real space when there is an extra spacing (bug 931481)	2021-11-12 21:42:32 +01:00
Calixte Denizet	a88ff34eb7	Don't consider space as real space when there is an extra spacing (bug 931481) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481; - real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one; - for example, an extra spacing can cancel a space so visually there are no space.	2021-11-12 18:53:48 +01:00
Calixte Denizet	5b7e1f5232	XFA - Avoid an exception when looking for a font in a parent node - it aims to fix issue https://github.com/mozilla/pdf.js/issues/14150; - a parent can be null in case the root has been reached, so just add a check.	2021-11-12 16:27:08 +01:00
Calixte Denizet	33ea817b20	[api-minor] Render pushbuttons on their own canvas (bug 1737260) - First step to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1737260; - several interactive pdfs use the possibility to hide/show buttons to show different icons; - render pushbuttons on their own canvas and then insert it the annotation_layer; - update test/driver.js in order to convert canvases for pushbuttons into images.	2021-11-12 15:37:33 +01:00
Jonas Jenwald	ea1c348c67	Always prefer abbreviated keys, over full ones, when doing any dictionary lookups (issue 14256) Note that issue 14256 was specifically about inline images, please refer to: - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.1852045 - https://www.pdfa.org/safedocs-unearths-pdf-inline-image-issue/ - https://pdf-issues.pdfa.org/32000-2-2020/clause08.html#H8.9.7 However, during review of the initial PR in https://github.com/mozilla/pdf.js/pull/14257#issuecomment-964469710, it was suggested that we instead do this unconditionally for all dictionary lookups. In addition to re-ordering the existing call-sites in the `src/core`-code, and adding non-PRODUCTION/TESTING asserts to catch future errors, for consistency a number of existing `if`/`switch`-blocks were re-factored to also check the abbreviated keys first.	2021-11-10 11:56:18 +01:00
calixteman	4bb9de4b00	Merge pull request #14239 from calixteman/1739502 XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502)	2021-11-08 03:14:42 -08:00
Calixte Denizet	13ae6d493a	XFA - Encode tag names in UTF-8 when saving (fix #14249 )	2021-11-07 21:41:37 +01:00
calixteman	efb4455749	Merge pull request #14240 from calixteman/14014 XFA - Get each page asynchronously in order to avoid blocking the event loop (#14014)	2021-11-06 13:21:43 -07:00
Calixte Denizet	1681e25008	XFA - Get each page asynchronously in order to avoid blocking the event loop (#14014 )	2021-11-06 13:25:03 +01:00
Brendan Dahl	8161d3f29d	Don't double apply a group xobject's bbox. In `beginGroup` we create a new canvas that is the size of the bounding box and we translate it to the offset. This means we don't need to also apply the bounding box during `paintFormXObjectBegin`. This improves #6961 quite a bit, but it still is missing the indention in the ruler.	2021-11-05 15:40:58 -07:00
Calixte Denizet	a08763f4aa	XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502; - when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created). - the pdf shows an alignment issue on page 4: - the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout; - there is an extra empty page at the end of the pdf: - there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility). - so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.	2021-11-05 18:59:55 +01:00
calixteman	e136afbabc	Merge pull request #14218 from janekotovich/subform_min_0 XFA subform with occur min=0 and no bound data displaying.	2021-11-05 04:12:34 -07:00
Jane-Kotovich	56b502391c	XFA subform with occur min=0 and no bound data displaying Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0> If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf - The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it. However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause: - eliminate unmatched data with occur min=0 we are checking our empty data and sending it to uselessNode array where at the end it gets removed;	2021-11-04 20:22:05 +10:00
Jonas Jenwald	5f77d3719b	Tweak the Bidi-detection heuristics for very short RTL strings (issue 11656) Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour. In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can obviously cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).	2021-11-03 20:31:57 +01:00
Jonas Jenwald	8c70258065	Merge pull request #14182 from calixteman/richtext Support rich content in markup annotation	2021-10-31 14:41:56 +01:00
Calixte Denizet	cf8dc750d6	Support rich content in markup annotation - use the xfa parser but in the xhtml namespace.	2021-10-31 13:44:51 +01:00
Tim van der Meij	ec1633c33c	Merge pull request #14201 from Snuffleupagus/bug-1219400 Use the correct border-style for Annotations, when a dash array is specified (bug 1219400)	2021-10-30 12:39:46 +02:00
Tim van der Meij	0e7614df7f	Merge pull request #14180 from Snuffleupagus/bug-1627427 Handle ranges that "overflow" the last byte in `CMap.mapBfRange` (bug 1627427)	2021-10-27 20:06:09 +02:00
Jonas Jenwald	884caf602e	Use the correct border-style for Annotations, when a dash array is specified (bug 1219400) Even though we cannot use the dash array in the display layer, at least ensure that we use the correct border-style.	2021-10-27 13:20:21 +02:00
Jane-Kotovich	91fc643ff9	[api-minor] Implement securityHandler in the scripting API (bug 1731578)	2021-10-26 23:42:04 +10:00
Jonas Jenwald	aa1b78684f	Handle ranges that "overflow" the last byte in `CMap.mapBfRange` (bug 1627427)	2021-10-24 13:48:38 +02:00
Brendan Dahl	b66239d6dc	Merge pull request #14114 from Snuffleupagus/issue-14110 [api-minor] Include the /Lang-property in the `documentInfo`, and use it in the viewer (issue 14110)	2021-10-19 08:08:08 -07:00
Jonas Jenwald	68e6622c57	Ignore Square/Circle-annnotations with a zero borderWidth when creating a fallback appearance stream (issue 14164) Trying to render these Annotation-types, when the borderWidth is `0`, causes a "hairline" border to appear. If these Annotations included an appearance stream, as they are supposed to, this wouldn't have happened and the simplest solution here seem to be to just ignore these particular Annotations.	2021-10-19 15:27:42 +02:00
calixteman	bbb64369f1	Merge pull request #13424 from calixteman/chunks2 [api-minor] Fix issues in text selection	2021-10-18 06:14:15 -07:00
Calixte Denizet	61d1063276	Fix issues in text selection - PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).	2021-10-17 16:27:05 +02:00
Jonas Jenwald	00720d059a	[api-minor] Include the /Lang-property in the `documentInfo`, and use it in the viewer (issue 14110) Please note: This is a tentative patch, since I don't have the necessary a11y-software to actually test it. To avoid having to add a new API-method just for a single string, I figured that adding the new property to the existing `documentInfo`-data (accessed via `PDFDocumentProxy.getMetadata` in the API) will hopefully be deemed acceptable.	2021-10-16 14:27:47 +02:00
Jonas Jenwald	0041230072	Re-name the `XFAFactory.numberPages` getter to `XFAFactory.numPages` for consistency All other similar getters are called `numPages` throughout the code-base, and improved consistency should always be a good thing.	2021-10-16 12:56:21 +02:00
Jonas Jenwald	0e5348180e	Fix the inconsistent return type of the `PDFDocument.isPureXfa` getter Also (slightly) simplifies a couple of small getters/methods related to the `XFAFactory`-instance.	2021-10-16 12:56:20 +02:00
Jonas Jenwald	cd94a44ca1	Remove some duplication in simple shadowed getters in `src/core/`-code In these cases there's no good reason, in my opinion, to duplicate the `shadow`-lines since that unnecessarily increases the risk of simple typos (see the previous patch).	2021-10-16 12:56:17 +02:00
Jonas Jenwald	1450da4168	Fix a `xfaFaxtory` typo in the shadowing in the `PDFDocument.xfaFactory` getter With this typo the shadowing doesn't actually work, which causes these checks to be unnecessarily repeated. In this particular case it didn't have a significant performance impact, however we should definately fix this nonetheless.	2021-10-16 11:54:12 +02:00
Jane-Kotovich	c2af309917	XFA - Embedded image is missing	2021-10-15 21:12:29 +10:00
Jay Berkenbilt	586295fad6	Implement TrueType character map "format 2" (fixes #14117 ) If a PDF included an embedded TrueType font whose preferred character map (cmap) was in "format 2", the code would select that character map and then refuse to read it because of an unsupported format, thus causing the characters not to be rendered. This commit implements support for format 2 as described at the link below. https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html	2021-10-13 07:37:14 -04:00
Tim van der Meij	56e3ef68d4	Merge pull request #14106 from calixteman/names Empty name is allowed in ISO 32000	2021-10-09 14:29:10 +02:00

... 2 3 4 5 6 ...

2599 Commits