Currently the `isSpace` utility function is a member of `Lexer`, which seems suboptimal, given that it's placed in `core/parser.js`. In practice, this means that in a number of `core/*.js` files we thus have an *otherwise* completely unnecessary dependency on `core/parser.js` for a one-line function.
Instead, this patch moves `isSpace` into `shared/util.js` which seems more appropriate for this kind of utility function. Not to mention that since all the affected `core/*.js` files already depends on `shared/util.js`, this doesn't incur any more file dependencies.
As evident from e.g. PRs 6485 and 7118, some bad PDF generators unfortunately create Arrays where *some* elements are indirect objects (i.e. `Ref`s). This seems to mostly affect Arrays that contain numbers, such as e.g. `Matrix/FontMatrix/BBox/FontBBox/Rect/Color/...`, and has manifested itself in PDF files that fail to render correctly (some elements are missing).
The problem in both the cases above, besides broken rendering, was that there were *no* errors/warnings that indicated what the problem was, making it difficult to pinpoint the issue.
Hence this patch, where I've audited all usages of `Dict_get` in `src/core/` files, and replaced it with `Dict_getArray` where appropriate to try and prevent unnecessary future bugs.
DecodeStream currently initializes its |buffer| field to |null|, which
is reasonable, because lots of DecodeStreams never need to instantiate a
buffer. But this requires various special cases in the code.
This patch change it so DecodeStreamClosure has a single empty
Uint8Array which gets shared between all buffers upon initialization.
This avoids the special cases.
DecodeStream.prototype.ensureBuffer() is really hot, and this removes a
test from the fast path. For one 226 page scanned document this sped up
rendering by about 2%.
This is achieved by adding getBytes2() and getBytes4() to streams, and by
changing int16() and int32() to take multiple scalar args instead of an array
arg.
When decoding a stream, the decode buffer is often grown multiple times, its
byte size increasing like so: 512, 1024, 2048, etc. This patch estimates the
minimum size in advance (using the length of the encoded stream), often
allowing the smaller sizes to be skipped. It also renames numerous |length|
variables as |maybeLength| to make it clear that they can be |null|.
I measured this change on eight documents. This change reduces the cumulative
size of decode buffer allocations by 0--32%, with 10--20% being typical. This
reduces peak RSS by 10 or 20 MiB for several of them.
This avoids lots of unnecessary work when such streams are referred to via
fetch(), and so their bytes aren't subsequently read. This is a large
performance win on some files.
At the initialization of `Lexer_getObj` (in `parser.js`), there's a loop
that skips whitespace and breaks out whenever EOF is encountered.
(https://github.com/mozilla/pdf.js/blob/88ec2bd1a/src/core/parser.js#L586-L599)
Whenever the current character is not a whitespace character,
`ch = this.nextChar();` is used to find the next character
(using `return this.currentChar = this.stream.getByte())`).
The aforementioned `getByte` method retrieves the next byte using
(https://github.com/mozilla/pdf.js/blob/88ec2bd1a/src/core/stream.js#L122-L128)
var pos = this.pos;
while (this.bufferLength <= pos) {
if (this.eof)
return -1;
this.readBlock();
}
return this.buffer[this.pos++];
This piece of code relies on this.eof to detect whether the last character
has been read. When the stream is a `FlateStream`, and the end of the stream
has been reached, then **`this.eof` is not set to `true`**, because this check
is done inside a loop that does not occur when the read block size is zero:
(https://github.com/mozilla/pdf.js/blob/88ec2bd1ac/src/core/stream.js#L511-L517)
for (var n = bufferLength; n < end; ++n) {
if (typeof (b = bytes[bytesPos++]) == 'undefined') {
this.eof = true;
break;
}
buffer[n] = b;
}
This commit fixes the issue by setting this.eof to true whenever the loop is not
going to run (i.e. when bufferLength === end, i.e. blockLen === 0).