Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
Samuel Yuan	36fb5c1e2b	Propagate the translated font name to TextContentItems. This allows font data for system fonts to be looked up in the PDFObjects.	2022-11-08 11:16:21 -08:00
Jonas Jenwald	23930a249e	[api-minor] Let `Catalog.getAllPageDicts` return an empty dictionary when loading the first /Page fails (issue 15590) In order to support opening certain corrupt PDF documents, particularly hand-edited ones, this patch adds support for letting the `Catalog.getAllPageDicts` method fallback to returning an empty dictionary to replace (only) the first /Page of the document. Given that the viewer cannot initialize/load without access to the first page, this will thus allow e.g. document-level scripting to run as expected. Note that by effectively replacing a corrupt or missing first /Page in this way[1], we'll now render nothing but a blank page for certain cases of broken/corrupt PDF documents which may look weird. Please note: This functionality is controlled via the existing `stopAtErrors` option, that can be passed to `getDocument`, since it's easy to imagine use-cases where this sort of fallback behaviour isn't desirable. --- [1] Currently we still require that a /Pages-dictionary is found though, however it may be possible to relax even that assumption if that becomes absolutely necessary in future corrupt documents.	2022-11-03 12:51:48 +01:00
Jonas Jenwald	2516ffa78e	Fallback to finding the first "obj" occurrence, when the trailer-dictionary is incomplete (issue 15590) Note that the "trailer"-case is already a fallback, since normally we're able to use the "xref"-operator even in corrupt documents. However, when a "trailer"-operator is found we still expect "startxref" to exist and be usable in order to advance the stream position. When that's not the case, as happens in the referenced issue, we use a simple fallback to find the first "obj" occurrence instead. This partially fixes issue 15590, since without this patch we fail to find any objects at all during `XRef.indexObjects`. However, note that the PDF document is still corrupt and won't render since there's no actual /Pages-dictionary and the /Root-entry simply points to the /OpenAction-dictionary instead.	2022-11-03 12:46:30 +01:00
Jonas Jenwald	ce66fefbff	[api-minor] Add partial support for the "GoToE" action (issue 8844) Please note: The referenced issue is the only mention that I can find, in either GitHub or Bugzilla, of "GoToE" actions. Hence why I've purposely settled for a very simple, and partial, "GoToE" implementation to avoid complicating things initially.[1] In particular, this patch only supports "GoToE" actions that references the /EmbeddedFiles-dict in the PDF document. See https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2048909 --- [1] Usually I always prefer having real-world test-cases to work with, whenever I'm implementing new features.	2022-10-06 10:33:07 +02:00
Jonas Jenwald	c87f90102c	Add more non-standard ligatures in the `glyphlist.js` file (issue 15516) Note that this PR only adds the "underscore"-variant of actually existing ligatures, however the referenced PDF document also uses a couple of non-standard ones (e.g. `ft`, `Th`, and `fh`) that we cannot easily support without larger changes (since they don't have official Unicode-entries). Given that it's clearly the PDF document, and its fonts, that's the culprit here it's not entirely clear to me that we actually want to attempt a larger refactoring/rewriting of the `glyphlist.js` code, assuming it's even generally possible. Especially when this patch alone already improves our copy-paste behaviour when compared to both Adobe Reader and PDFium, and that this is only the second time this sort of bug has been reported.	2022-09-27 16:31:51 +02:00
Jonas Jenwald	cc4baa2fe9	[api-minor] Add basic support for the `SetOCGState` action (issue 15372) Note that this patch implements the `SetOCGState`-handling in `PDFLinkService`, rather than as a new method in `OptionalContentConfig`[1], since this action is nothing but a series of `setVisibility`-calls and that it seems quite uncommon in real-world PDF documents. The new functionality also required some tweaks in the `PDFLayerViewer`, to ensure that the `layersView` in the sidebar is updated correctly when the optional-content visibility changes from "outside" of `PDFLayerViewer`. --- [1] We can obviously move this code into `OptionalContentConfig` instead, if deemed necessary, but for an initial implementation I figured that doing it this way might be acceptable.	2022-09-01 17:34:24 +02:00
Jonas Jenwald	216b86a082	[api-minor] Support Named-actions in the outline (issue 15367) Apparently this is implemented in e.g. Adobe Reader, and the specification does support it, however it cannot be commonly used in real-world PDF documents since it took over ten years for this feature to be requested.	2022-08-30 18:47:45 +02:00
Calixte Denizet	c06c5f7cbd	[Annotations] charLimit === 0 means unlimited (bug 1782564) Changing the charLimit in JS had no impact, so this patch aims to fix that and add an integration test for it.	2022-08-19 11:28:28 +02:00
Jonas Jenwald	dd95e4f851	Add official support for passing `ArrayBuffer`-data to `getDocument` (issue 15269) While this has always worked, as a consequence of the implementation, it's never been officially supported. In addition to adding basic unit-tests, this patch also introduces a couple of new JSDoc `@typedef`s in the API to avoid overly long lines.	2022-08-10 14:13:01 +02:00
Jonas Jenwald	0c31320c12	[api-minor] Improve `thumbnail` handling in documents that contain interactive forms To improve performance of the sidebar we use the page-canvases to generate the thumbnails whenever possible, since that avoids unnecessary re-rendering when the sidebar is open. This works generally well, however there's an old problem in PDF documents that contain interactive forms (when those are enabled): Note how the thumbnails become partially (or fully) blank, since those Annotations are not included in the OperatorList.[1] We obviously want to keep using the `PDFThumbnailView.setImage`-method for most documents, however we need a way to skip it only for those pages that contain interactive forms. As it turns out it's unfortunately not all that simple to tell, after the fact, from looking only at the OperatorList that some Annotations were skipped. While it might have been possible to try and infer that in the viewer, it'd not have been pretty considering that at the time when rendering finishes the annotationLayer has not yet been built. The overall simplest solution that I could come up with, was instead to include a summary of the interactive form-state when doing the final "flushing" of the OperatorList and expose that information in the API. --- [1] Some examples from our test-suite: `annotation-tx2.pdf` where the thumbnail is completely blank, and `bug1737260.pdf` where the thumbnail is missing the "buttons" found on the page.	2022-07-30 16:53:32 +02:00
Jonas Jenwald	2fb083f3e2	Ensure that the `isUsingOwnCanvas`-parameter is consistently included in operatorLists (PR 14247 follow-up) Currently some `OPS.beginAnnotation` arguments will contain a `Number` value for the `isUsingOwnCanvas`-parameter, or in some cases an `undefined` value, which is inconsistent from an API perspective.	2022-07-28 13:37:37 +02:00
Jonas Jenwald	c2f7942aea	Ensure that the /Resources-entry is actually a dictionary (issue 15150) Prevent issues in corrupt PDF documents, if the /Resources-entry is not of the correct and expected type.	2022-07-08 12:43:43 +02:00
Calixte Denizet	3789dab307	Always flush the current item with MarkedContent stuff when getting text (#15094 )	2022-06-25 17:19:57 +02:00
Jonas Jenwald	1cc7cecc7b	[api-minor] Introduce a `PrintAnnotationStorage` with frozen serializable data Given that printing is triggered synchronously in browsers, it's thus possible for scripting (in PDF documents) to modify the Annotation-data while printing is currently ongoing. To work-around that we add a new printing-specific `AnnotationStorage`, where the serializable data is frozen upon initialization, which the viewer can thus create/utilize during printing.	2022-06-23 17:06:46 +02:00
Calixte Denizet	cdc58b7a52	Rotate annotations based on the MK::R value (bug 1675139) - it aims to fix: https://bugzilla.mozilla.org/show_bug.cgi?id=1675139; - An annotation can be rotated (counterclockwise); - the rotation can be set in using JS.	2022-06-21 17:57:26 +02:00
Jonas Jenwald	bbf857d635	[api-minor] Stop using the `beginAnnotations`/`endAnnotations` operators (PR 14998 follow-up) After the changes in PR 14998, these operators are now no-ops in the `src/display/canvas.js` code and should no longer be necessary. Given that `beginAnnotations`/`endAnnotations` are not in the PDF specification, but are rather custom PDF.js operators, it seems reasonable to stop using them now that they've become no-ops.	2022-06-11 14:21:26 +02:00
Jonas Jenwald	8135d7ccf6	Merge pull request #14869 from calixteman/14862 [JS] Fix few bugs present in the pdf for issue #14862	2022-05-03 18:31:31 +02:00
Calixte Denizet	094ff38da0	[JS] Fix few bugs present in the pdf for issue #14862 - since resetForm function reset a field value a calculateNow is consequently triggered. But the calculate callback can itself call resetForm, hence an infinite recursive loop. So basically, prevent calculeNow to be triggered by itself. - in Firefox, the letters entered in some fields were duplicated: "AaBb" instead of "AB". It was mainly because beforeInput was triggering a Keystroke which was itself triggering an input value update and then the input event was triggered. So in order to avoid that, beforeInput calls preventDefault and then it's up to the JS to handle the event. - fields have a property valueAsString which returns the value as a string. In the implementation it was wrongly used to store the formatted value of a field (2€ when the user entered 2). So this patch implements correctly valueAsString. - non-rendered fields can be updated in using JS but when they're, they must take some properties in the annotationStorage. It was implemented for field values, but it wasn't for display, colors, ... - it fixes #14862 and #14705.	2022-05-03 15:48:44 +02:00
Jonas Jenwald	df5a4fd0a7	Support encoded dest-strings in /GoTo destination dictionaries (issue 14864) Interestingly enough this appears to be the very first case of encoded dest-strings, in /GoTo destination dictionaries, that we've actually come across. What's really fascinating is that it's less than a week after issue 14847, given that these issues are somewhat similar.	2022-05-02 10:14:32 +02:00
Jonas Jenwald	71370d012b	Support destinations in NameTrees with encoded keys (issue 14847) Initially I considered updating the `NameOrNumberTree`-implementation to handle encoded keys, however that quickly became somewhat messy (especially in the `NameOrNumberTree.get`-method) since only NameTrees using string-keys. Hence the easiest solution, as far as I'm concerned, was thus to just update the `Catalog.destinations`-getter instead. Please note that in the referenced PDF document the `Catalog.destination`-method will thus fallback to fetch all destinations, which should be fine since this is the very first case of encoded keys that we've seen. Also changes the `NameOrNumberTree.getAll`-method to prevent a possible run-time error, although we've so far not seen such a case, for any non-Array Kids-entries found in a NameTree/NumberTree. Finally, to improve overall consistency and to hopefully prevent future bugs, the patch also updates a couple of other `NameTree` call-sites to correctly handle encoded keys. (Note that the `Catalog.attachments`-getter was already doing this.)	2022-04-27 11:19:55 +02:00
Calixte Denizet	040fcae5ab	Improve performance with image masks (bug 857031) - it aims to partially fix performance issue reported: https://bugzilla.mozilla.org/show_bug.cgi?id=857031; - the idea is too avoid to use byte arrays but use ImageBitmap which are a way faster to draw: * an ImageBitmap is Transferable which means that it can be built in the worker instead of in the main thread: - this is achieved in using an OffscreenCanvas when it's available, there is a bug to enable them for pdf.js: https://bugzilla.mozilla.org/show_bug.cgi?id=1763330; - or in using createImageBitmap: in Firefox a task is sent to the main thread to build the bitmap so it's slightly slower than using an OffscreenCanvas. * it's transfered from the worker to the main thread by "reference"; * the byte buffers used to create the image data have a very short lifetime and ergo the memory used is globally less than before. - Use the localImageCache for the mask; - Fix the pdf issue4436r.pdf: it was expected to have a binary stream for the image; - Move the singlePixel trick from operator_list to image: this way we can use this trick even if it isn't in a set as defined in operator_list.	2022-04-09 18:26:26 +02:00
Calixte Denizet	18e79e3c0b	[text selection] Add the whitespaces present in the pdf in the text chunk - it aims to fix issue #14627; - the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces. But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream they weren't in the text chunks because they were too small. Hence we added some exceptions, for example, we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj. So basically, this patch removes the constraint to have the chars in the same Tj (in using a circular buffer to save the two last chars) but don't add a space when the visible space is really too small (hence `NOT_A_SPACE_FACTOR`).	2022-03-27 14:34:56 +02:00
Jonas Jenwald	fb345ee184	Enable the "gets fieldObjects" unit-test in Node.js (PR 14409 follow-up) Apparently this unit-test works in Node.js now, hence it's possible that the reason it didn't work previously is that there were bugs in our old `structuredClone` polyfill.	2022-03-13 10:40:57 +01:00
Jonas Jenwald	537ed37835	Move the `isSameOrigin` helper function This function is currently placed in the `src/shared/util.js` file, which means that the code is duplicated in both of the built `pdf.js` and `pdf.worker.js` files. Furthermore, it only has a single call-site which is also specific to the `GENERIC`-build of the PDF.js library. Hence this helper function is instead moved into the `src/display/api.js` file, in such a way that it's conditionally defined but still can be unit-tested.	2022-03-10 13:51:09 +01:00
Jonas Jenwald	172d007598	[api-minor] Add validation for the `PDFDocumentProxy.getPageIndex` method Currently we'll happily attempt to send any argument passed to this method over to the worker-thread, without doing any sort of validation. That could obviously be quite bad, since there's first of all no protection against sending unclonable data. Secondly, it's also possible to pass data that will cause the `Ref.get` call in the worker-thread to fail immediately. In order to address all of these issues, we'll now properly validate the argument passed to `PDFDocumentProxy.getPageIndex` and when necessary reject already on the main-thread instead.	2022-02-24 12:01:51 +01:00
Jonas Jenwald	2be8036eb7	[api-minor] Reduce duplication in the "gets non-existent page" unit-test	2022-02-24 11:25:21 +01:00
Calixte Denizet	18e3a98c2b	[api-minor] Don't add in the text content the chars which are out-of-page (bug 1755201) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1755201; - if the glyph position is not within the view then skip it.	2022-02-13 21:07:11 +01:00
Jonas Jenwald	1f0fb270b1	[api-minor] Ensure that the `PDFDocumentLoadingTask`-promise is rejected when cancelling the PasswordPrompt (bug 1754421) This is essentially a continuation of PR 7926, where we added support for rejecting the current `PDFDocumentLoadingTask`-promise by throwing inside of the `onPassword`-callback. Hence the naive way to address [bug 1754421](https://bugzilla.mozilla.org/show_bug.cgi?id=1754421) would be to simply throw in the `onPassword`-callback used in the default viewer. However it unfortunately turns out to not work, since the password input/validation is asynchronous, and we thus need another approach. The simplest solution that I can come up with here, is thus to extend the `onPassword`-callback to also reject the current `PDFDocumentLoadingTask`-instance if an `Error` is explicitly passed as the input to the callback function. (This doesn't feel great, but I cannot see a better solution that isn't really complicated.)	2022-02-09 15:09:20 +01:00
Jonas Jenwald	403baa7bba	[api-minor] Remove the `normalizeWhitespace` option in the `PDFPageProxy.{getTextContent, streamTextContent}` methods (issue 14519, PR 14428 follow-up) With these changes, we'll now always replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.	2022-02-03 09:17:22 +01:00
Calixte Denizet	ae842e1c3a	[api-minor] Annotations - Adjust the font size in text field in considering the total width (bug 1721335) - it aims to fix #14502 and bug 1721335; - Acrobat and Pdfium do the same; - it'll avoid to have truncated data when printed; - change the factor to compute font size in using field height: lineHeight = 1.35*fontSize - this is the value used by Acrobat. - in order to not have truncated strings on the bottom, add few basic metrics for standard fonts.	2022-01-30 15:53:31 +01:00
Calixte Denizet	e1d3a3b414	Remove the invisible format marks from the text chunks - it aims to fix issue #9186.	2022-01-24 13:47:24 +01:00
Jonas Jenwald	b9849e38b8	Expose even more API-functionality in the TypeScript definitions (issue 14435, PR 14013 follow-up) While `PageViewport` apparently makes sense in TypeScript environments, given that it's being returned by the `PDFPageProxy.getViewport`-method in the API, we really don't want to extend the public API by simply exporting the class directly in `src/pdf.js` since it should never be called/initialized manually. Hence we follow the same pattern as in PR 14013, and also extend the API unit-tests to ensure that `PDFPageProxy.getViewport` always returns a `PageViewport`-instance as expected.	2022-01-13 12:05:40 +01:00
Jonas Jenwald	b99927e1ee	Improve the API unit-tests for scripting-related functionality I happened to notice that we didn't have any unit-tests for either `getFieldObjects` or `getCalculationOrderIds`, on the `PDFDocumentProxy` class, which seems unfortunate since it's API functionality that we depend on in e.g. the viewer.	2021-12-29 12:57:32 +01:00
Jonas Jenwald	b513c64d9d	[api-minor] Convert `Catalog.getPageDict` to an asynchronous method Besides converting `Catalog.getPageDict` to an `async` method, thus simplifying the code, this patch also allows us to pro-actively fix a existing issue. Note how we're looking up References in such a way that `MissingDataException`s won't cause trouble, however it's technically possible that the entries (i.e. /Count, /Kids, and /Type) in a /Pages Dictionary could actually be indirect objects as well. In the existing code this could lead to some, or even all, pages failing to load/render as intended. In practice that doesn't appear to happen in real-world PDF documents, but given all the weird things that PDF software do I'd prefer to fix this pro-actively (rather than waiting for a bug report). With `Catalog.getPageDict` being `async` this is now really simple to address, however I didn't want to introduce a bunch more unconditional asynchronicity in this method if it could be avoided (since that could slow things down). Hence we'll synchronously lookup the raw data in a /Pages Dictionary, and only fallback to asynchronous data lookup when a Reference was encountered. In addition to the above, this patch also makes the following notable changes: - Let `Catalog.getPageDict` consistently reject with the actual error, regardless of what data we're fetching. Previously we'd "swallow" the actual errors except when looking up Dictionary entries, which is inconsistent and thus seem unfortunate. As can be seen from the updated unit-tests this change is API-observable, hence why the patch is tagged `[api-minor]`. - Improve the consistency of the Dictionary /Type-checks in both the `Catalog.getPageDict` and `Catalog.getAllPageDicts` methods. In `Catalog.getPageDict` there's a fallback code-path where we're incorrectly checking the /Page Dictionary for a /Contents-entry, which is wrong since a /Page Dictionary doesn't need to have a /Contents-entry in order to be valid. For consistency the `Catalog.getAllPageDicts` method is also updated to handle errors in the /Type-lookup correctly. - Reduce the `PagesCountLimit.PAUSE_EAGER_PAGE_INIT` viewer constant, to further improve loading/rendering performance of the second page during initialization of very long documents; PR 14359 follow-up.	2021-12-25 15:22:48 +01:00
Jonas Jenwald	e8562173b8	Prevent an infinite loop when parsing corrupt /CCITTFaxDecode data (issue 14305) Fixes one of the documents in issue 14305.	2021-12-07 13:57:25 +01:00
Tim van der Meij	335c4c8a43	Merge pull request #14338 from Snuffleupagus/XRef-more-Pages-validation [api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303)	2021-12-04 13:23:40 +01:00
Jonas Jenwald	40291d1943	Handle errors when fetching the raw /Metadata (issue 14305) Currently the `Catalog.metadata` getter only handles errors during parsing, however in a corrupt PDF document fetching of the raw /Metadata can obviously fail as well. Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it never should and this will cause the viewer to not initialize all state as expected. Fixes one of the documents in issue 14305.	2021-12-04 09:41:42 +01:00
Jonas Jenwald	ad3a271fc4	[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) This patch improves handling of a couple of PDF documents from issue 14303. - Update `XRef.indexObjects` to actually clear all XRef-caches. Invalid XRef tables usually cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well. - Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched and that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents. - Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.) With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than appearing to work).	2021-12-03 11:57:38 +01:00
Jonas Jenwald	1fac6371d3	[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) Please note: This is similar to the method that existed prior to PR 3848, but the new method will only be used as a fallback when parsing of corrupt PDF documents. The implementation in PR 14311 unfortunately turned out to be way too simplistic, as evident by the recently added test-files in issue 14303, since it may cause infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1] To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the entire /Pages-tree when the /Count-entry validation fails during document initialization. Fixes at least two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents. --- [1] The whole point of PR 14311 was obviously to get rid of infinte loops during document initialization, not to introduce any more of those.	2021-12-02 14:31:04 +01:00
Jonas Jenwald	8ea740c800	Slightly extend the "creates pdf doc from PDF file with bad XRef table" unit-test (PR 14304 follow-up) Given that we're able to "render" this document, let's extend the unit-test to actually check that we're able to obtain the operatorList; although given the overall issues in the document it'll be empty.	2021-12-02 11:51:40 +01:00
Jonas Jenwald	63be23f05b	Handle errors correctly when data lookup fails during /Pages-tree parsing (issue 14303) This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree. Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.	2021-12-02 10:54:40 +01:00
Jonas Jenwald	a807ffe907	Prevent circular references in XRef tables from hanging the worker-thread (issue 14303) Please note: While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load and render correctly. Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table. To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method and the `Parser`/`Lexer`-implementations are completely synchronous. Note also that the existing `XRef`-caching, used for all data-types except Streams, should hopefully help to lessen the performance impact of these changes. One potential problem with these changes could be certain browser exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so). Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that one bad reference doesn't prevent an entire document from loading. Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.	2021-11-27 23:50:26 +01:00
Jonas Jenwald	d0c4bbd828	[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) This patch basically extends the approach from PR 10392, by also checking the last page. Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid. As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser). Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages. To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages. Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug. - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents. - For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents, - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost. As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value). Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.	2021-11-27 21:57:35 +01:00
Jonas Jenwald	ca8d2bdce4	Abort parsing when the XRef /W-array contain bogus entries (issue 14303) For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense. While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser). Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now. Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.	2021-11-25 18:35:08 +01:00
Jonas Jenwald	ae4f1ae3e7	Ensure that `ChunkedStream` won't attempt to request data beyond the document size (issue 14303) This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test). The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from beyond the actual document size and thus getting into an infinite loop. Fixes one of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.	2021-11-24 19:19:43 +01:00
Jonas Jenwald	0ebac67a9f	Remove the `{BaseViewer, PDFThumbnailViewer}._pagesRequests` caches In the `BaseViewer` this cache is mostly relevant in the `disableAutoFetch = true` mode, since the pages are being initialized lazily in that case. In the `PDFThumbnailViewer` this cache is mostly used for thumbnails that are actually being rendered, as opposed to those created directly from the "regular" pages. Please note that I'm not suggesting that we remove these caches because they're only used in some situations, but rather because they're for all intents and purposes actually redundant. In the API itself, we're already caching both the page-promises and the actual pages themselves on the `WorkerTransport`-instance. Hence these viewer-caches aren't really necessary in practice, and adds what to me mostly seems like an unnecessary level of indirection.[1] Given that the viewer now relies on caching in the API itself, this patch also adds a new unit-test to ensure that page-caching works (and keep working) as expected. --- [1] In the `WorkerTransport.getPage`-method the parameter is being validated on every call, but that's hardly enough code to warrant keeping the "duplicate" caches in the viewer in my opinion.	2021-11-21 11:40:45 +01:00
Jonas Jenwald	6da0944fc7	[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter Please note: These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents. The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for every rendered page. This patch proposes replacing that method with a synchronous `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and only send them to the main-thread the first time that a type is encountered. Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1] This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes. Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return the same identical object. This is something that we can easily take advantage of in the default viewer, by now only reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents). --- [1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see `41ac3f0c07/src/shared/util.js (L206-L232)` [2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread. [3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code. In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549	2021-11-20 12:20:55 +01:00
Jonas Jenwald	971ac8e993	Include the /Lang-property, when it exists, in the StructTree-data (issue 14261) Please note: This is a tentative patch, since I don't have the necessary a11y-software to actually test it.	2021-11-14 12:37:41 +01:00
calixteman	85c6dd59ce	Merge pull request #14268 from calixteman/outline Remove non-displayable chars from outline title (#14267)	2021-11-13 08:12:56 -08:00
Calixte Denizet	7041c62ccf	Remove non-displayable chars from outline title (#14267 ) - it aims to fix #14267; - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.	2021-11-13 16:56:08 +01:00

1 2 3 4 5 ...