Looking at this again, it struck me that added functionality in `Util.intersect` is probably more confusing than helpful in general; sorry about the churn in this code!
Based on the parameter name you'd probably expect it to only match when the intersection is `[0, 0, 0, 0]` and not when only one component is zero, hence the `skipEmpty` parameter thus feels too tightly coupled to the `Page.view` getter.
This is based on a real-world PDF file I encountered very recently[1], although I'm currently unable to recall where I saw it.
Note that different PDF viewers handle these sort of errors differently, with Adobe Reader outright failing to render the attached PDF file whereas PDFium mostly handles it "correctly".
The patch makes the following notable changes:
- Refactor the `cropBox` and `mediaBox` getters, on the `Page`, to reduce unnecessary duplication. (This will also help in the future, if support for extracting additional page bounding boxes are added to the API.)
- Ensure that the page bounding boxes, i.e. `cropBox` and `mediaBox`, are never empty to prevent issues/weirdness in the viewer.
- Ensure that the `view` getter on the `Page` will never return an empty intersection of the `cropBox` and `mediaBox`.
- Add an *optional* parameter to `Util.intersect`, to allow checking that the computed intersection isn't actually empty.
- Change `Util.intersect` to have consistent return types, since Arrays are of type `Object` and falling back to returning a `Boolean` thus seem strange.
---
[1] In that case I believe that only the `cropBox` was empty, but it seemed like a good idea to attempt to fix a bunch of related cases all at once.
Firefox telemetry supports using string labels now. Convert our integers
that we used for categories to just use strings.
The upstream work will happen in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1566882
For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least.
The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds *two* different new `text` tests.
Given that the function is (purposely) independent of the verbosity level and that its message is worded to only apply on the main-thread, there's no reason to duplicate this across the built `pdf.js`/`pdf.worker.js` files.
The `src/shared/util.js` file is being bundled into both the `pdf.js` and `pdf.worker.js` files, meaning that its code is by definition duplicated.
Some main-thread only utility functions have already been moved to a separate `src/display/display_utils.js` file, and this patch simply extends that concept to utility functions which are used *only* on the worker-thread.
Note in particular the `getInheritableProperty` function, which expects a `Dict` as input and thus *cannot* possibly ever be used on the main-thread.
For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1].
Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase.
Here the choice is made to attempt to load the *first* page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made.
Obviously, just because the first page can be loaded successfully that doesn't guarantee that the *entire* XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is *not* valid[2].
Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call.
Whether or not this is a problem depends very much on what you actually measure, please consider the following examples:
```javascript
console.time('first');
getDocument(...).promise.then((pdfDocument) => {
console.timeEnd('first');
});
console.time('second');
getDocument(...).promise.then((pdfDocument) => {
pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`.
console.timeEnd('second');
});
});
```
The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable.
---
[1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated.
In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects.
[2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the *first* page.
[3] The only extra parsing is caused by, potentially, having to traverse *part* of the `Pages` tree to find the first page.
Compared to all the other (static) methods in `Util`, the `toRoman` one looks slightly out of place. Even more so considering that `Util` is being exposed through `pdfjsLib`, where access to a Roman numerals conversion method doesn't make much sense.
This moves/exposes the `URL` polyfill similarily to the existing `ReadableStream` polyfill, rather than exposing it globally, to avoid interfering with any "outside" code.
Both the `URL` and `ReadableStream` polyfills are now exposed on the `pdfjsLib` object, such that they are accessible to the viewer components.
Furthermore, the `no-restricted-globals` ESLint rule is also enabled to prevent accidental usage of the native `URL`/`ReadableStream` implementations directly in the `src/` and `web/` folders; see also https://eslint.org/docs/rules/no-restricted-globals
Addresses the remaining TODO in https://github.com/mozilla/pdf.js/projects/6
Not only is the `Util.loadScript` helper function unused on the Worker side, even trying to use it there would throw an Error (since `document` isn't defined/available in Workers).
Hence this helper function is moved, and its code modernized slightly by having it return a Promise rather than needing a callback function.
Finally, to reduced code duplication, the "new" loadScript function is exported and used in the viewer.
The `MessageHandler` itself, and its assorted helper functions, are currently the single largest[1] piece of code in the `src/shared/util.js` file. By moving this code into its own file, `src/shared/util.js` thus becomes smaller and more manageable.
This function combines the logic of two separate methods into one.
The loop limit is also a good thing to have for the calls in
`src/core/annotation.js`.
Moreover, since this is important functionality, a set of unit tests and
documentation is added.
Since `loadJpegStream` is only used at a *single* spot in the code-base, and given that it's very heavily tailored to the calling code (since it relies on the data structure of `PDFObjects`), this patch simply inlines the code in `src/display/api.js` instead.
These were removed in PR 9170, since they were unused in the browsers that we'll support in PDF.js version `2.0`.
However looking at the output of Travis, where a subset of the unit-tests are run using Node.js, there's warnings about `btoa` being undefined. This doesn't appear to cause any errors, which probably explains why we didn't notice this before (despite PR 9201).
It is quite confusing that the custom function is called `log2` while it
actually returns the ceiling value and handles zero and negative values
differently than the native function.
To resolve this, we add a comment that explains these differences and
make the function use the native `Math` functions internally instead of
using our own custom logic. To verify that the function does what we
expect, we add unit tests.
All browsers except for IE support `Math.log2` for quite a long time
already (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/log2).
For IE, we use the core-js polyfill.
According to the microbenchmark at https://jsperf.com/log2-pdfjs/1,
using the native functions should also be faster, in my testing almost
six times as fast.
`__pdfjsdev_webpack__` was used to skip evaluating part of an AST,
in order to not mangle some `require` symbols.
This commit removes `__pdfjsdev_webpack__`, and:
- Uses `__non_webpack_require__` when one wants the output to
contain `require` instead of `__webpack_require__`.
- Adds options to the webpack config to prevent "polyfills" for
some Node.js-specific APIs to be added.
- Use `// eslint-disable-next-line no-undef` instead of `/* globals ... */`
for variables that are not meant to be used globally.
This patch adds Streams API support in getTextContent
so that we can stream data in chunks instead of fetching
whole data from worker thread to main thread. This patch
supports Streams API without changing the core functionality
of getTextContent.
Enqueue textContent directly at getTextContent in partialEvaluator.
Adds desiredSize and ready property in streamSink.
Adds functionality to accept Queueing Strategy in
sendWithStream method. Using Queueing Strategy we
can control the data that is enqueued into the sink,
and hence regulated the flow of chunks from worker
to main thread.
Adds capability in pull and cancel methods.
Adds ready and desiredSize property in streamSink.
Adds unit test for ReadableStream and sendWithStream.
http://eslint.org/docs/rules/comma-danglehttp://eslint.org/docs/rules/object-curly-spacing
Given that we currently have quite inconsistent object formatting, fixing this in *one* big patch probably wouldn't be feasible (since I cannot imagine anyone wanting to review that); hence I've opted to try and do this piecewise instead.
Please note: This patch was created automatically, using the ESLint `--fix` command line option. In a couple of places this caused lines to become too long, and I've fixed those manually; please refer to the interdiff below for the only hand-edits in this patch.
```diff
diff --git a/src/display/canvas.js b/src/display/canvas.js
index 5739f6f2..4216b2d2 100644
--- a/src/display/canvas.js
+++ b/src/display/canvas.js
@@ -2071,7 +2071,7 @@ var CanvasGraphics = (function CanvasGraphicsClosure() {
var map = [];
for (var i = 0, ii = positions.length; i < ii; i += 2) {
map.push({ transform: [scaleX, 0, 0, scaleY, positions[i],
- positions[i + 1]], x: 0, y: 0, w: width, h: height, });
+ positions[i + 1]], x: 0, y: 0, w: width, h: height, });
}
this.paintInlineImageXObjectGroup(imgData, map);
},
diff --git a/src/display/svg.js b/src/display/svg.js
index 9eb05dfa..2aa21482 100644
--- a/src/display/svg.js
+++ b/src/display/svg.js
@@ -458,7 +458,11 @@ SVGGraphics = (function SVGGraphicsClosure() {
for (var x = 0; x < fnArrayLen; x++) {
var fnId = fnArray[x];
- opList.push({ 'fnId': fnId, 'fn': REVOPS[fnId], 'args': argsArray[x], });
+ opList.push({
+ 'fnId': fnId,
+ 'fn': REVOPS[fnId],
+ 'args': argsArray[x],
+ });
}
return opListToTree(opList);
},
```
Added test for ReadableStream.
Adds ref-implementation license-header in streams-lib
and change gulp task to copy external/streams/ in build/
external/streams/ and build/dist/external/streams folder.
Adds README.md and LICENSE.md