pdf.js

Author	SHA1	Message	Date
Calixte Denizet	74f25d2755	Font renderer - get int8 instead of uint8 in composite glyphes (bug 1749563) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1749563; - use some helper functions to get (u\|i)int** values in buffer: it helps to have a clearer code; - in composite glyphes the translations values with a transformations are signed so consequently get some int8 instead of uint8; - add few TODOs.	2022-01-18 22:06:23 +01:00
Jonas Jenwald	a13ae5d97d	Support Type1 font files with incomplete /CharStrings definitions (issue 14462) Please refer to https://www.pdfa.org/norm-refs/Type1Fonts.pdf#page=15 for the expected format for the /CharStrings entries. In the referenced PDF document the /CharStrings are missing the expected end-token, which causes us to swallow the start of the next glyph name.	2022-01-17 18:55:22 +01:00
Tim van der Meij	e08fd5e389	Implement a unit test for `getCharUnicodeCategory` in `src/core/unicode.js` (PR 14428 follow-up) Given that the other functions in this file are already covered by unit tests, we should also cover this newly added function.	2022-01-16 15:18:05 +01:00
Tim van der Meij	922dac035c	Merge pull request #14448 from Snuffleupagus/Type3-circular-refs Prevent circular references in Type3 fonts	2022-01-15 14:11:47 +01:00
Tim van der Meij	a72d188599	Merge pull request #14439 from Snuffleupagus/issue-14438 Ignore Annotations with empty /Rect-entries in the display-layer (issue 14438)	2022-01-15 14:11:25 +01:00
Tim van der Meij	625f829842	Merge pull request #14446 from Snuffleupagus/issue-14435 Expose even more API-functionality in the TypeScript definitions (issue 14435, PR 14013 follow-up)	2022-01-15 13:46:11 +01:00
Jonas Jenwald	76444888fb	Add (basic) UTF-8 support in the `stringToPDFString` helper function (issue 14449) This patch implements this by looking for the UTF-8 BOM, i.e. `\xEF\xBB\xBF`, in order to determine the encoding.[1] The actual conversion is done using the `TextDecoder` interface, which should be available in all environments/browsers that we support; please see https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder#browser_compatibility --- [1] Assuming that everything lacking a UTF-16 BOM would have to be UTF-8 encoded really doesn't seem correct.	2022-01-14 18:57:07 +01:00
Jonas Jenwald	4c55563574	Add an additional test-case for circular references in Type3 fonts The PDF document in this patch already worked without the previous patch, but I wanted to improve our test-coverage for the Type3-parsing. The attached PDF document was also found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs	2022-01-13 17:59:57 +01:00
Jonas Jenwald	53d4ee7990	Prevent circular references in Type3 fonts In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing. Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs Please note: That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents. Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.	2022-01-13 17:58:37 +01:00
Jonas Jenwald	b9849e38b8	Expose even more API-functionality in the TypeScript definitions (issue 14435, PR 14013 follow-up) While `PageViewport` apparently makes sense in TypeScript environments, given that it's being returned by the `PDFPageProxy.getViewport`-method in the API, we really don't want to extend the public API by simply exporting the class directly in `src/pdf.js` since it should never be called/initialized manually. Hence we follow the same pattern as in PR 14013, and also extend the API unit-tests to ensure that `PDFPageProxy.getViewport` always returns a `PageViewport`-instance as expected.	2022-01-13 12:05:40 +01:00
Jonas Jenwald	08d88a0235	Ignore Annotations with empty /Rect-entries in the display-layer (issue 14438) This prevents the `BaseSVGFactory.create`-method from throwing, and thus preventing any remaining Annotations (on the page) from rendering in corrupt documents.	2022-01-11 13:54:35 +01:00
Jonas Jenwald	457ff0d54a	Update Jasmine to version 4 For the unit-tests that were updated in this patch, note that I settled on simply using `toEqual` comparisons rather than updating the custom matchers (since those don't seem necessary any more). Please refer to the following resources for additional information: - https://github.com/jasmine/jasmine/blob/main/release_notes/4.0.0.md - https://github.com/jasmine/jasmine-npm/blob/main/release_notes/4.0.0.md - https://jasmine.github.io/tutorials/upgrading_to_Jasmine_4.0	2022-01-09 11:32:34 +01:00
Tim van der Meij	8ac0ccc227	Merge pull request #14424 from Snuffleupagus/mv-addLinkAttributes [api-minor] Move `addLinkAttributes`, `LinkTarget`, and `removeNullCharacters` into the viewer (PR 14092 follow-up)	2022-01-08 13:19:11 +01:00
Calixte Denizet	6369617e6f	[JS] Fix few errors around AFSpecial_Keystroke - @cincodenada found some errors which are fixed in this patch; - it partially fixes issue #14306; - add some tests.	2022-01-08 12:34:56 +01:00
Jonas Jenwald	7b8794b37e	[api-minor] Move `removeNullCharacters` into the viewer This helper function has never been used in e.g. the worker-thread, hence its placement in `src/shared/util.js` led to a small amount of unnecessary duplication. After the previous patches this helper function is now only used in the viewer, hence it no longer seems necessary to expose it through the official API. Please note: It seems somewhat unlikely that third-party users were relying directly on this helper function, which is why it's not being exported as part of the viewer components. (If necessary, we can always change this later on.)	2022-01-06 12:25:33 +01:00
Jonas Jenwald	290cbc5232	Merge pull request #14418 from calixteman/14415 Use positive dimensions for text chunks in the text layer (issue #14415)	2022-01-05 12:00:36 +01:00
Calixte Denizet	6cdae5ac4d	Use positive dimensions for text chunks in the text layer (issue #14415 ).	2022-01-05 10:49:56 +01:00
Jonas Jenwald	2722deb610	Revert "Disable failing print actions integration test in Firefox"	2022-01-04 14:19:27 +01:00
Jonas Jenwald	b99927e1ee	Improve the API unit-tests for scripting-related functionality I happened to notice that we didn't have any unit-tests for either `getFieldObjects` or `getCalculationOrderIds`, on the `PDFDocumentProxy` class, which seems unfortunate since it's API functionality that we depend on in e.g. the viewer.	2021-12-29 12:57:32 +01:00
Tim van der Meij	e42d54e1b5	Merge pull request #14400 from Snuffleupagus/getPageDict-async [api-minor] Convert `Catalog.getPageDict` to an asynchronous method	2021-12-28 19:40:34 +01:00
Jonas Jenwald	b513c64d9d	[api-minor] Convert `Catalog.getPageDict` to an asynchronous method Besides converting `Catalog.getPageDict` to an `async` method, thus simplifying the code, this patch also allows us to pro-actively fix a existing issue. Note how we're looking up References in such a way that `MissingDataException`s won't cause trouble, however it's technically possible that the entries (i.e. /Count, /Kids, and /Type) in a /Pages Dictionary could actually be indirect objects as well. In the existing code this could lead to some, or even all, pages failing to load/render as intended. In practice that doesn't appear to happen in real-world PDF documents, but given all the weird things that PDF software do I'd prefer to fix this pro-actively (rather than waiting for a bug report). With `Catalog.getPageDict` being `async` this is now really simple to address, however I didn't want to introduce a bunch more unconditional asynchronicity in this method if it could be avoided (since that could slow things down). Hence we'll synchronously lookup the raw data in a /Pages Dictionary, and only fallback to asynchronous data lookup when a Reference was encountered. In addition to the above, this patch also makes the following notable changes: - Let `Catalog.getPageDict` consistently reject with the actual error, regardless of what data we're fetching. Previously we'd "swallow" the actual errors except when looking up Dictionary entries, which is inconsistent and thus seem unfortunate. As can be seen from the updated unit-tests this change is API-observable, hence why the patch is tagged `[api-minor]`. - Improve the consistency of the Dictionary /Type-checks in both the `Catalog.getPageDict` and `Catalog.getAllPageDicts` methods. In `Catalog.getPageDict` there's a fallback code-path where we're incorrectly checking the /Page Dictionary for a /Contents-entry, which is wrong since a /Page Dictionary doesn't need to have a /Contents-entry in order to be valid. For consistency the `Catalog.getAllPageDicts` method is also updated to handle errors in the /Type-lookup correctly. - Reduce the `PagesCountLimit.PAUSE_EAGER_PAGE_INIT` viewer constant, to further improve loading/rendering performance of the second page during initialization of very long documents; PR 14359 follow-up.	2021-12-25 15:22:48 +01:00
KouWakai	98158b67a3	Handle non-integer Annotation border widths correctly (issue 14203) The existing code appears to be wrong, since according to the PDF specification the border width of an Annotation only has to be a number and not specifically an integer. Please see: - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=392 - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2096210 - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G6.1965562	2021-12-24 22:10:19 +09:00
Tim van der Meij	71326c6a1c	Enable the `no-var` linting rule in `test/testutils.js` This is done automatically with the `gulp lint --fix` command with the only exception of the `parts` variable.	2021-12-18 15:58:47 +01:00
Tim van der Meij	a24982a733	Drop custom confirmation logic in favor of using the built-in Node.js `readline` module Most likely this code predates our use of Node.js, but in Node.js asking for user confirmation is a solved problem, so we can remove the custom logic we have for this, which overall makes things much simpler.	2021-12-18 15:52:04 +01:00
Jonas Jenwald	e0dba504d2	Fix broken/missing JSDocs and `typedef`s, to allow updating TypeScript to the latest version (issue 14342) This patch circumvents the issues seen when trying to update TypeScript to version `4.5`, by "simply" fixing the broken/missing JSDocs and `typedef`s such that `gulp typestest` now passes. As always, given that I don't really know anything about TypeScript, I cannot tell if this is a "correct" and/or proper way of doing things; we'll need TypeScript users to help out with testing! Please note: I'm sorry about the size of this patch, but given how intertwined all of this unfortunately is it just didn't seem easy to split this into smaller parts. However, one good thing about this TypeScript update is that it helped uncover a number of pre-existing bugs in our JSDocs comments.	2021-12-15 23:14:25 +01:00
Jonas Jenwald	0a19ef6864	Move the `EventBus`, and related functionality, into its own file The size of the `web/ui_utils.js` file has increased over time, as more code has been added to (or moved into) that file. To reduce its size slightly, this patch moves the event-related functionality into a separate file.	2021-12-15 17:18:57 +01:00
Tim van der Meij	1bc6b846b6	Consistently use string arguments for `page.waitForFunction` calls We use string arguments in all other places, so these two places are a bit inconsistent in that sense. Moreover, it's just one argument now, which makes it a bit easier to read and see what it does because we don't have to pass the always-empty options argument anymore. Finally, doing it like this ensures it works in all Puppeteer versions given https://github.com/puppeteer/puppeteer/issues/7836.	2021-12-12 19:45:34 +01:00
Tim van der Meij	2643e6a823	Disable failing print actions integration test in Firefox Once the upstream bug is fixed it can be enabled again because it's causing way too much noise now. This is tracked in issue #14293. Note that I deliberately added a new block so we can easily remove it later on and because the other block is about another bug.	2021-12-12 16:10:50 +01:00
Jonas Jenwald	e8562173b8	Prevent an infinite loop when parsing corrupt /CCITTFaxDecode data (issue 14305) Fixes one of the documents in issue 14305.	2021-12-07 13:57:25 +01:00
Jonas Jenwald	909f012fb8	Add a (linked) test-case for issue 8022 Given that [bug 1336591](https://bugzilla.mozilla.org/show_bug.cgi?id=1336591) was just closed as fixed, thus fixing issue 8022 in Firefox, let's add a test-case to enable us to catch any future regressions either in PDF.js or in browsers themselves.	2021-12-06 15:27:40 +01:00
Tim van der Meij	911a9d34b1	Fix code duplication in the rasterization logic in `test/driver.js` Now that the rasterization logic is encapsulated in a class, we can easily move the container creation into a separate static method.	2021-12-05 19:29:39 +01:00
Tim van der Meij	03506f25c0	Move the rasterization logic into one single class This refactoring ensures that we can get rid of the closures and encapsulate the logic in a nicer way with e.g., getters for the style promises.	2021-12-05 19:28:51 +01:00
Tim van der Meij	33dc0628a0	Enable the `no-var` linting rule in `test/driver.js` This is done automatically with the `gulp lint --fix` command with the only exception of the `annotationLayerContext` variable.	2021-12-05 15:41:36 +01:00
Tim van der Meij	5fd4276dcf	Use async/await in the rasterization classes in `test/driver.js` This is achieved by letting the `writeSVG` function return a promise so we don't need callback passing anymore.	2021-12-05 14:11:09 +01:00
Tim van der Meij	13786ef806	Use arrow functions instead of `self` variables in `test/driver.js`	2021-12-05 14:11:08 +01:00
Tim van der Meij	1d1f713bfc	Inline `loadStyles` calls in the rasterization classes in `test/driver.js` The wrapper functions in this case only really added indirection, so this commit simplifies the code a bit.	2021-12-05 13:49:04 +01:00
Tim van der Meij	a58700b0dc	Convert the `Driver` class to ES6 syntax in `test/driver.js`	2021-12-05 13:43:02 +01:00
Tim van der Meij	dc455c836e	Merge pull request #14339 from Snuffleupagus/issue-8019-reftest Add a (linked) test-case for issue 8019	2021-12-04 13:26:47 +01:00
Tim van der Meij	335c4c8a43	Merge pull request #14338 from Snuffleupagus/XRef-more-Pages-validation [api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303)	2021-12-04 13:23:40 +01:00
Jonas Jenwald	40291d1943	Handle errors when fetching the raw /Metadata (issue 14305) Currently the `Catalog.metadata` getter only handles errors during parsing, however in a corrupt PDF document fetching of the raw /Metadata can obviously fail as well. Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it never should and this will cause the viewer to not initialize all state as expected. Fixes one of the documents in issue 14305.	2021-12-04 09:41:42 +01:00
Jonas Jenwald	ca82e1832f	Add a (linked) test-case for issue 8019 Given that [bug 1336572](https://bugzilla.mozilla.org/show_bug.cgi?id=1336572) was just closed as fixed, thus fixing issue 8019 in Firefox[1], let's add a test-case to enable us to catch any future regressions either in PDF.js or in browsers themselves. --- [1] It also seems to be working in Google Chrome, although I'm having a slightly difficult time deciphering exactly what configurations were affected when looking through issue 8019.	2021-12-04 08:56:04 +01:00
Jonas Jenwald	ad3a271fc4	[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) This patch improves handling of a couple of PDF documents from issue 14303. - Update `XRef.indexObjects` to actually clear all XRef-caches. Invalid XRef tables usually cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well. - Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched and that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents. - Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.) With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than appearing to work).	2021-12-03 11:57:38 +01:00
Jonas Jenwald	1fac6371d3	[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) Please note: This is similar to the method that existed prior to PR 3848, but the new method will only be used as a fallback when parsing of corrupt PDF documents. The implementation in PR 14311 unfortunately turned out to be way too simplistic, as evident by the recently added test-files in issue 14303, since it may cause infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1] To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the entire /Pages-tree when the /Count-entry validation fails during document initialization. Fixes at least two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents. --- [1] The whole point of PR 14311 was obviously to get rid of infinte loops during document initialization, not to introduce any more of those.	2021-12-02 14:31:04 +01:00
Jonas Jenwald	8ea740c800	Slightly extend the "creates pdf doc from PDF file with bad XRef table" unit-test (PR 14304 follow-up) Given that we're able to "render" this document, let's extend the unit-test to actually check that we're able to obtain the operatorList; although given the overall issues in the document it'll be empty.	2021-12-02 11:51:40 +01:00
Jonas Jenwald	63be23f05b	Handle errors correctly when data lookup fails during /Pages-tree parsing (issue 14303) This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree. Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.	2021-12-02 10:54:40 +01:00
Tim van der Meij	0d2cdff6c5	Fix browser page navigation for Puppeteer 11+ in `test/test.js` In Puppeteer 11 we noticed that Firefox doesn't shut down once the tests are done anymore. I tracked this down to the `page.goto` call, in `startBrowser`, never resolving anymore. I can only assume that something changed in Puppeteer, possibly in combination with recent Firefox Nightly versions, that caused this, but haven't been able to fully track it down. However, I did find that the problem is that the `load` event no longer triggers, so fortunately we can fix the problem by explicitly waiting for the `domcontentloaded` event instead. In general this change might even be better since we now wait until the test framework is fully loaded before we continue. Note that this also still works for the current Puppeteer version. I did find two upstream references that appear to track this issue, both on the Puppeteer side and on the Firefox side, making me further suspect that the issue is partly on both sides: - https://github.com/puppeteer/puppeteer/issues/5806 - https://bugzilla.mozilla.org/show_bug.cgi?id=1706353	2021-11-28 18:58:22 +01:00
Tim van der Meij	60ed3cd297	Fix compatibility with Node.js 17 in `test/test.js` Node.js 17, which as of writing is the most recent version, contains a breaking change in its DNS resolver, causing Firefox not to start anymore in our test framework. The inline comment together with the following resources provide more background: - https://github.com/nodejs/node/issues/40702 - https://github.com/nodejs/node/pull/39987 - https://github.com/cyrus-and/chrome-remote-interface/issues/467 - https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V17.md#other-notable-changes - https://github.com/DeviceFarmer/adbkit/issues/209 - https://nodejs.org/api/dns.html#dnssetdefaultresultorderorder This commit ensures that versions both older and newer than Node.js 17 work as expected. This is mainly necessary since the bots as of writing run Node.js 14.17.0 which is from before this API got introduced and for example Node.js 12 LTS is only end-of-life in April 2022, so we have to keep support for those older versions unfortunately.	2021-11-28 18:52:51 +01:00
Tim van der Meij	5309133a9d	Fix browser error logging in `test/test.js` If a browser cannot be started, we currently get the following log: `Error while starting firefox: [object Object]`. This is simply an oversight from the initial Puppeteer integration work since we never got into this code path before. With this fix the error log becomes more useful: `Error while starting firefox: connect ECONNREFUSED ::1:45387`	2021-11-28 18:08:08 +01:00
Jonas Jenwald	a807ffe907	Prevent circular references in XRef tables from hanging the worker-thread (issue 14303) Please note: While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load and render correctly. Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table. To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method and the `Parser`/`Lexer`-implementations are completely synchronous. Note also that the existing `XRef`-caching, used for all data-types except Streams, should hopefully help to lessen the performance impact of these changes. One potential problem with these changes could be certain browser exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so). Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that one bad reference doesn't prevent an entire document from loading. Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.	2021-11-27 23:50:26 +01:00
Jonas Jenwald	d0c4bbd828	[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) This patch basically extends the approach from PR 10392, by also checking the last page. Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid. As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser). Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages. To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages. Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug. - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents. - For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents, - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost. As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value). Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.	2021-11-27 21:57:35 +01:00

... 3 4 5 6 7 ...

2887 Commits