pdf.js

Author	SHA1	Message	Date
Calixte Denizet	18f4e560ae	[Search] Some matches were incorrectly shifted because of some '-\n' - it aims to fix #14562; - 'X-\n' were not correctly positioned; - when X is a diacritic (e.g. in "sä-\n", which is decomposed into "sa¨-\n") we must handle both things: - diacritics on the one hand; - "-\n" on the other hand.	2022-02-14 10:12:33 +01:00
Calixte Denizet	18e3a98c2b	[api-minor] Don't add in the text content the chars which are out-of-page (bug 1755201) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1755201; - if the glyph position is not within the view then skip it.	2022-02-13 21:07:11 +01:00
Brendan Dahl	f8b2a99ddc	Merge pull request #14543 from Snuffleupagus/bug-1753983 Let `Lexer.getNumber` treat a single minus sign as zero (bug 1753983)	2022-02-09 14:06:35 -08:00
Jonas Jenwald	60efae96fd	Update the file used with the `xfa_bug1720182` test-case The file used in this test-case is identical to, i.e. the md5 entry perfectly matches, the file used with the `xfa_bug1716380` test-case. While it's obviously fine to use the same PDF document in different reference-tests, note how we e.g. have both `eq` and `text` tests for one document, we should always avoid adding duplicate files in the `test/pdfs/` folder.	2022-02-08 14:23:41 +01:00
Jonas Jenwald	64f3dbeb48	Let `Lexer.getNumber` treat a single minus sign as zero (bug 1753983) This appears to be consistent with the behaviour in both Adobe Reader and PDFium (in Google Chrome); this is essentially the same approach as used for a single decimal point in PR 9827.	2022-02-07 17:09:47 +01:00
Calixte Denizet	1f41028fcb	Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) - get original index in using a dichotomic seach instead of a linear one; - normalize the text in using NFD; - convert the query string into a RegExp; - replace whitespaces in the query with \s+; - handle hyphens at eol use to break a word; - add some \s* around punctuation signs	2022-02-03 15:42:55 +01:00
Jonas Jenwald	e13353cf21	Remove the `xfa_bug1716838` browser-test since it's a duplicate The md5 entry perfectly matches the `xfa_bug1717668_1` test-case, which means that we unnecessarily test the same exact document twice.	2022-02-01 12:05:19 +01:00
Jonas Jenwald	ef1676678f	Remove the `MBTA-pretax-form-July2012` browser-test since it's a duplicate The md5 entry perfectly matches the `xfa_bug1718521_2` test-case, which means that we unnecessarily test the same exact document twice.	2022-01-31 12:35:26 +01:00
Calixte Denizet	ae842e1c3a	[api-minor] Annotations - Adjust the font size in text field in considering the total width (bug 1721335) - it aims to fix #14502 and bug 1721335; - Acrobat and Pdfium do the same; - it'll avoid to have truncated data when printed; - change the factor to compute font size in using field height: lineHeight = 1.35*fontSize - this is the value used by Acrobat. - in order to not have truncated strings on the bottom, add few basic metrics for standard fonts.	2022-01-30 15:53:31 +01:00
calixteman	838909f8c1	Merge pull request #14491 from quaoaris/lines-rendered-too-thick fix for lines (stroke) are rendered too thick (Bug 1743245)	2022-01-27 18:46:26 +01:00
Calixte Denizet	3a7004ca25	Take into account all rotations before comparing glyph positions - it aims to fix #14497; - previously, only rotations with an angle 0, 90, 180 or 270 were taken into account; - so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.	2022-01-26 17:19:00 +01:00
quaoaris	3f77d80f31	fix for lines (stroke) are rendered too thick (Bug 1743245) This commit fixes Bug 1743245 (Grided PDF file lines rendered too thick) which was created by a fix for #12868 . The lineWidth was set to round(1 * this._combinedScaleFactor) when the pixel is drawn as a parallelorgam with a height <1. This fix changes this to floor(1*this._combinedScaleFactor) . This change shows a visual result comparable to Chrome and Acrobat. Regarding the last PR 3 statements in canvas.js are affected and will change with this commit (stroke and paintChar). renaming the reference files to naming comvention	2022-01-25 10:27:30 +01:00
Calixte Denizet	e1d3a3b414	Remove the invisible format marks from the text chunks - it aims to fix issue #9186.	2022-01-24 13:47:24 +01:00
calixteman	88236e1163	Merge pull request #14430 from calixteman/beforeinput [JS] Use beforeinput event to trigger a keystroke event in the sandbox	2022-01-23 20:42:33 +01:00
Calixte Denizet	6ac296e48e	[JS] Use beforeinput event to trigger a keystroke event in the sandbox - it aims to fix issue #14307; - this event has been added recently in Firefox and we can now use it; - fix few bugs in aform.js or in annotation_layer.js; - add some integration tests to test keystroke events (see `AFSpecial_Keystroke`); - make dispatchEvent in the quickjs sandbox async.	2022-01-23 19:53:01 +01:00
Tim van der Meij	23b6fde9fc	Merge pull request #14464 from Snuffleupagus/issue-14462 Support Type1 font files with incomplete /CharStrings definitions (issue 14462)	2022-01-19 20:38:46 +01:00
Calixte Denizet	74f25d2755	Font renderer - get int8 instead of uint8 in composite glyphes (bug 1749563) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1749563; - use some helper functions to get (u\|i)int** values in buffer: it helps to have a clearer code; - in composite glyphes the translations values with a transformations are signed so consequently get some int8 instead of uint8; - add few TODOs.	2022-01-18 22:06:23 +01:00
Jonas Jenwald	a13ae5d97d	Support Type1 font files with incomplete /CharStrings definitions (issue 14462) Please refer to https://www.pdfa.org/norm-refs/Type1Fonts.pdf#page=15 for the expected format for the /CharStrings entries. In the referenced PDF document the /CharStrings are missing the expected end-token, which causes us to swallow the start of the next glyph name.	2022-01-17 18:55:22 +01:00
Tim van der Meij	922dac035c	Merge pull request #14448 from Snuffleupagus/Type3-circular-refs Prevent circular references in Type3 fonts	2022-01-15 14:11:47 +01:00
Jonas Jenwald	4c55563574	Add an additional test-case for circular references in Type3 fonts The PDF document in this patch already worked without the previous patch, but I wanted to improve our test-coverage for the Type3-parsing. The attached PDF document was also found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs	2022-01-13 17:59:57 +01:00
Jonas Jenwald	53d4ee7990	Prevent circular references in Type3 fonts In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing. Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs Please note: That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents. Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.	2022-01-13 17:58:37 +01:00
Jonas Jenwald	08d88a0235	Ignore Annotations with empty /Rect-entries in the display-layer (issue 14438) This prevents the `BaseSVGFactory.create`-method from throwing, and thus preventing any remaining Annotations (on the page) from rendering in corrupt documents.	2022-01-11 13:54:35 +01:00
Calixte Denizet	6cdae5ac4d	Use positive dimensions for text chunks in the text layer (issue #14415 ).	2022-01-05 10:49:56 +01:00
KouWakai	98158b67a3	Handle non-integer Annotation border widths correctly (issue 14203) The existing code appears to be wrong, since according to the PDF specification the border width of an Annotation only has to be a number and not specifically an integer. Please see: - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=392 - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2096210 - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G6.1965562	2021-12-24 22:10:19 +09:00
Jonas Jenwald	e8562173b8	Prevent an infinite loop when parsing corrupt /CCITTFaxDecode data (issue 14305) Fixes one of the documents in issue 14305.	2021-12-07 13:57:25 +01:00
Jonas Jenwald	909f012fb8	Add a (linked) test-case for issue 8022 Given that [bug 1336591](https://bugzilla.mozilla.org/show_bug.cgi?id=1336591) was just closed as fixed, thus fixing issue 8022 in Firefox, let's add a test-case to enable us to catch any future regressions either in PDF.js or in browsers themselves.	2021-12-06 15:27:40 +01:00
Tim van der Meij	dc455c836e	Merge pull request #14339 from Snuffleupagus/issue-8019-reftest Add a (linked) test-case for issue 8019	2021-12-04 13:26:47 +01:00
Jonas Jenwald	40291d1943	Handle errors when fetching the raw /Metadata (issue 14305) Currently the `Catalog.metadata` getter only handles errors during parsing, however in a corrupt PDF document fetching of the raw /Metadata can obviously fail as well. Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it never should and this will cause the viewer to not initialize all state as expected. Fixes one of the documents in issue 14305.	2021-12-04 09:41:42 +01:00
Jonas Jenwald	ca82e1832f	Add a (linked) test-case for issue 8019 Given that [bug 1336572](https://bugzilla.mozilla.org/show_bug.cgi?id=1336572) was just closed as fixed, thus fixing issue 8019 in Firefox[1], let's add a test-case to enable us to catch any future regressions either in PDF.js or in browsers themselves. --- [1] It also seems to be working in Google Chrome, although I'm having a slightly difficult time deciphering exactly what configurations were affected when looking through issue 8019.	2021-12-04 08:56:04 +01:00
Jonas Jenwald	1fac6371d3	[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) Please note: This is similar to the method that existed prior to PR 3848, but the new method will only be used as a fallback when parsing of corrupt PDF documents. The implementation in PR 14311 unfortunately turned out to be way too simplistic, as evident by the recently added test-files in issue 14303, since it may cause infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1] To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the entire /Pages-tree when the /Count-entry validation fails during document initialization. Fixes at least two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents. --- [1] The whole point of PR 14311 was obviously to get rid of infinte loops during document initialization, not to introduce any more of those.	2021-12-02 14:31:04 +01:00
Jonas Jenwald	63be23f05b	Handle errors correctly when data lookup fails during /Pages-tree parsing (issue 14303) This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree. Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.	2021-12-02 10:54:40 +01:00
Jonas Jenwald	a807ffe907	Prevent circular references in XRef tables from hanging the worker-thread (issue 14303) Please note: While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load and render correctly. Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table. To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method and the `Parser`/`Lexer`-implementations are completely synchronous. Note also that the existing `XRef`-caching, used for all data-types except Streams, should hopefully help to lessen the performance impact of these changes. One potential problem with these changes could be certain browser exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so). Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that one bad reference doesn't prevent an entire document from loading. Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.	2021-11-27 23:50:26 +01:00
Jonas Jenwald	d0c4bbd828	[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) This patch basically extends the approach from PR 10392, by also checking the last page. Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid. As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser). Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages. To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages. Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug. - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents. - For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents, - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost. As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value). Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.	2021-11-27 21:57:35 +01:00
Calixte Denizet	31e13515f5	XFA - Draw arcs correctly - it aims to fix #14315; - take into account the startAngle to compute the coordinates of the final point.	2021-11-27 19:30:12 +01:00
Jonas Jenwald	ca8d2bdce4	Abort parsing when the XRef /W-array contain bogus entries (issue 14303) For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense. While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser). Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now. Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.	2021-11-25 18:35:08 +01:00
Jonas Jenwald	ae4f1ae3e7	Ensure that `ChunkedStream` won't attempt to request data beyond the document size (issue 14303) This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test). The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from beyond the actual document size and thus getting into an infinite loop. Fixes one of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.	2021-11-24 19:19:43 +01:00
calixteman	85c6dd59ce	Merge pull request #14268 from calixteman/outline Remove non-displayable chars from outline title (#14267)	2021-11-13 08:12:56 -08:00
Calixte Denizet	7041c62ccf	Remove non-displayable chars from outline title (#14267 ) - it aims to fix #14267; - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.	2021-11-13 16:56:08 +01:00
Jonas Jenwald	afcc99a86d	When parsing corrupt documents without any trailer-dictionary, fallback to the "top"-dictionary (issue 14269) There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens. Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's really corrupt, however we're at least able to render something with this patch. --- [1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.	2021-11-13 13:21:38 +01:00
Jonas Jenwald	28fb3975eb	Merge pull request #14266 from calixteman/bug931481 Don't consider space as real space when there is an extra spacing (bug 931481)	2021-11-12 21:42:32 +01:00
Calixte Denizet	a88ff34eb7	Don't consider space as real space when there is an extra spacing (bug 931481) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481; - real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one; - for example, an extra spacing can cancel a space so visually there are no space.	2021-11-12 18:53:48 +01:00
Calixte Denizet	5b7e1f5232	XFA - Avoid an exception when looking for a font in a parent node - it aims to fix issue https://github.com/mozilla/pdf.js/issues/14150; - a parent can be null in case the root has been reached, so just add a check.	2021-11-12 16:27:08 +01:00
Jonas Jenwald	ea1c348c67	Always prefer abbreviated keys, over full ones, when doing any dictionary lookups (issue 14256) Note that issue 14256 was specifically about inline images, please refer to: - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.1852045 - https://www.pdfa.org/safedocs-unearths-pdf-inline-image-issue/ - https://pdf-issues.pdfa.org/32000-2-2020/clause08.html#H8.9.7 However, during review of the initial PR in https://github.com/mozilla/pdf.js/pull/14257#issuecomment-964469710, it was suggested that we instead do this unconditionally for all dictionary lookups. In addition to re-ordering the existing call-sites in the `src/core`-code, and adding non-PRODUCTION/TESTING asserts to catch future errors, for consistency a number of existing `if`/`switch`-blocks were re-factored to also check the abbreviated keys first.	2021-11-10 11:56:18 +01:00
calixteman	4bb9de4b00	Merge pull request #14239 from calixteman/1739502 XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502)	2021-11-08 03:14:42 -08:00
Calixte Denizet	a08763f4aa	XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502) - it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502; - when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created). - the pdf shows an alignment issue on page 4: - the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout; - there is an extra empty page at the end of the pdf: - there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility). - so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.	2021-11-05 18:59:55 +01:00
calixteman	e136afbabc	Merge pull request #14218 from janekotovich/subform_min_0 XFA subform with occur min=0 and no bound data displaying.	2021-11-05 04:12:34 -07:00
Jonas Jenwald	8222d6530b	Merge pull request #14232 from brendandahl/show-text-pattern Use correct matrix for patterns with showText.	2021-11-05 10:04:56 +01:00
Brendan Dahl	1c7048399b	Use correct matrix for patterns with showText. We were incorrectly using the transform in the pattern before it had been adjusted causing the pattern to be misplaced relative to the page. Fixes: ShowText-ShadingPattern.pdf (already in corpus) Fixes: #8111 Fixes: #9243	2021-11-04 16:57:36 -07:00
Jane-Kotovich	56b502391c	XFA subform with occur min=0 and no bound data displaying Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0> If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf - The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it. However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause: - eliminate unmatched data with occur min=0 we are checking our empty data and sending it to uselessNode array where at the end it gets removed;	2021-11-04 20:22:05 +10:00
Jonas Jenwald	5f77d3719b	Tweak the Bidi-detection heuristics for very short RTL strings (issue 11656) Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour. In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can obviously cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).	2021-11-03 20:31:57 +01:00

1 2 3 4 5 ...

1184 Commits