I received multiple reports about the following cryptic error in the
Chrome extension when the user tried to open a local file:
> PDF.js v1.1.527 (build: 2096a2a)
> Message: Cannot read property 'Symbol(Symbol.iterator)' of null
This error most likely originated from core/stream.js:
function Stream(arrayBuffer, start, length, dict) {
this.bytes = (arrayBuffer instanceof Uint8Array ?
arrayBuffer : new Uint8Array(arrayBuffer));
^^^^^^^^^^^
`arrayBuffer` is `null`, and that in turn is caused by the fact that
for non-existing files, there is no data. I've applied two fixes:
1. Never call onDone with a void buffer, but call the error handler
instead.
2. Show a sensible error message for local files with status = 0.
In the `RenderPageRequest` handler in `worker.js`, we attempt to print an `info` message containing the rendering time and the length of the operator list. The latter is currently broken (and has been for quite some time), since the `length` of an `OperatorList` is reset when flushing occurs.
This patch attempts to rectify this, by adding a getter which keeps track of the total length.
`operatorList.addOp` adds the arguments to the list which is then
passed as-is by postMessage to the main thread. But since we don't
parse these operations, they are raw PDF objects and may therefore
cause a serialization error.
This is a conservative patch, and only affects operators which are
known to be unsupported. We should ignore all unknown operators,
but I haven't really looked into the consequences of doing that.
Fixes#6549
In PR 6485 I somehow missed to account for the case where `xref` is undefined. Since a dictonary can be initialized without providing a reference to an `xref` instance, `Dict_getArray` can thus fail without this added check.
According to the PDF spec 5.3.2, a positive value means in horizontal,
that the next glyph is further to the left (so narrower), and in
vertical that it is further down (so wider).
This change fixes the way PDF.js has interpreted the value.
This patch adjusts `get fingerprint` to also check that the `/ID` entry contains (non-empty) strings, to prevent more possible failures when loading corrupt PDF files (follow-up to PR 5602).
Note that I've not actually encountered such a PDF file in the wild. However given that `stringToBytes` will assert that the input is a string, and that we'll thus fail to load a document unless `get fingerprint` succeeds, making this more robust seems like a good idea to me.
For (1, 0) cmaps, we have two different codepaths depending on whether the font has/hasn't got an encoding. But with (3, 1) cmaps we don't have a good fallback when the encoding is missing, hence this patch changes `readCmapTable` to only choose a (3, 1) cmap table if the font is non-symbolic *and* an encoding exists. Without this, we'll not be able to successfully create a working glyph map for some TrueType fonts with (3, 1) cmap tables.
Fixes 6410.
This code was added in PR 1214, but was made obsolete by PRs 1488/1493. Prior to the latter ones, `Dict_get` retured the raw objects. However, afterwards (and currently) `Dict_get` now resolves indirect objects, which makes `Parser_fetchIfRef` redundant.
*Potential risks with this patch:*
This patch passes all tests locally, but there's a *small* possibility that it could break some weird PDF files.
In the current code, wrapping `Dict_get` inside `Parser_fetchIfRef` will potentially mean two back-to-back call of `XRef_fetch`, if a reference points directly to another reference. I'm not sure if this can actually happen in practice, and I'd think that if that were the case we'd already have run into it elsewhere in the code-base, given that `Parser` is the only place where we try to "double" resolve references.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1200096.
The problematic font has a `format 2` cmap, which we've never supported properly. Prior to PR 2606, we were able to fallback to a working state, despite not having proper support for that cmap format.
Obviously the best/correct solution would be to implement actual support for more cmap formats[1]. However, I'm hoping that a simple patch will be OK for now, given that:
- `format 2` cmaps seem to be quite rare in practice, since this has been broken for 2.5 years before anyone noticed.
- Having a simple patch will make potential uplifts a lot easier.
[1] See the specification at https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
Having a warning here would have meant that issue 6360 could have been solved in approximately five minutes, instead of an hour. To avoid that happening again, this patch adds a warning whenever we treat a stream as empty.
This patch improves the detection of `xref` in files where it is followed by an arbitrary whitespace character (not just a line-breaking char).
It also adds a check for missing whitespace, e.g. `1 0 obj<<`, to speed up `readToken` for the PDF file in the referenced issue.
Finally, the patch also replaces a bunch of magic numbers with suitably named constants.
Fixes 5752.
Also improves 6243, but there are still issues.
The problem with the PDF files in the issue, besides the obviously broken XRef tables which we're able to recover from, is that many/most of the streams have Dictionaries where the `Length` entry is set to `0`. This causes us to return `NullStream`, instead of the appropriate one in `Parser_makeFilter`.
Fixes 6360.
In some cases, such as in use with a CSP header, constructing a function with a
string of javascript is not allowed. However, compiling the various commands
that need to be done on the canvas element is faster than interpreting them.
This patch changes the font renderer to instead emit commands that are compiled
by the font loader. If, during compilation, we receive an EvalError, we instead
interpret them.
CMaps may be sparse. Array.prototype.forEach is terribly slow in Chrome
(and also in Firefox) when the sparse array contains a key with a high
value. E.g.
console.time('forEach sparse')
var a = [];
a[0xFFFFFF] = 1;
a.forEach(function(){});
console.timeEnd('forEach sparse');
// Chrome: 2890ms
// Firefox: 1345ms
Switching to CMap.prototype.forEach, which is optimized for such
scenarios fixes the problem.
pi is an index in the stream and is explained on page 201 of the 32000-spec (however 1-based there), and ps is an index to something in PDF.js. I used the code from flag 0 (which works) to understand which is which. It is also important to understand that for flags 1,2 and 3, the stream is always assigned to the same coordinates and colors. What changes is which "old" coordinates and colors are assigned to what is "missing" in the stream. This is why for these flags, the code is identical except for the assignments in the first "row". (Same principle as in #6304). Note that this change will not improve the lamp_cairo.pdf file, only the two files mentioned in #6305.
Short story: somebody got lost in two different indices. pi is an index in the stream and is explained on page 198 of the 32000-spec (however 1-based there), and ps is an index to something in PDF.js. I used the code from flag 0 (which works) to understand which is which. It is also important to understand that for flags 1,2 and 3, the stream is always assigned to the same coordinates and colors. What changes is which "old" coordinates and colors are assigned to what is "missing" in the stream. This is why for these flags, the code is identical except for the assignments in the first "row".
Currently, `src/core/core.js` uses the `fromRef` method on an `Annotation` object to obtain the right annotation type object (such as `LinkAnnotation` or `TextAnnotation`). That method in turn uses a method `getConstructor` to find out which annotation type object must be returned.
Aside from the fact that there is currently a lot of code to achieve this, these methods should not be part of the base `Annotation` class at all. Creation of annotation object should be done by a factory (as also recommended by @yurydelendik at https://github.com/mozilla/pdf.js/pull/5218#issuecomment-52779659) that handles finding out the correct annotation type object and returning it. This patch implements this separation of concerns.
Doing this allows us to also simplify the code quite a bit and to make it more readable. Additionally, we are now able to get rid of the hardcoded array of supported annotation types. The factory takes care of checking the annotation types and falls back to returning the base annotation type (and issuing a warning, which the current code also does not do well) when an annotation type is unsupported.
I have manually tested this commit with 20 test PDFs with different annotation types, such as /Link, /Text, /Widget, /FileAttachment and /FreeText. All render identically before and after the patch, and unsupported annotation types are now properly indicated with a warning in the console.
This patch refactors the code responsible for setting the annotation's rectangle. Its goal is to:
- Actually check that the input array is actually an array, and if so, that it contains exactly four elements.
- Only call `normalizeRect` if the input array is valid, i.e., we do not call it for the default rectangle anymore.
Unit tests are provided just like with the other patches in this series.
Fixes#6106
To avoid future regressions, two new unit tests were added:
1. A new PDF based on the report from #6106, which contains an
OpenAction of type JavaScript and a string "this.print({...}".
2. An existing PDF from https://bugzil.la/1001080 (from #4698).
Although it does not matter, since we don't execute the JavaScript code,
I have also changed "print(true)" to "print({})" since the print method
takes an object (not a boolean). See "Printing PDF documents", page 62:
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_developer_guide.pdf
Basic mathematics would suggest that a double negative should always become positive, but it appears that Adobe Reader simply ignores that case. Hence I think that it makes sense for us to do the same.
Fixes 6218.
When the parser finds a stream, it retrieves the Length from the stream
dictionary and advances the lexer to the offset as specified in Length.
If this Length is incorrect, the lexer could end up anywhere.
When the lexer gets in an invalid state, it could throw errors. For
example, in issue 6108, the lexer ends up inside the stream data. This
stream has the ASCIIHexDecode filter, so all characters are made up from
ASCII characters, and the lexer interprets it as a command token. Tokens
cannot be longer than 127 bytes, so eventually 128 bytes are consumed
and the lexer throws "Command token too long" error.
Another possible error is "Illegal character: 41" when the lexer happens
to end up at a ')' due to the length mismatch.
These problems are solved by catching lexer errors and recovering the
parser via the existing stream length detection branch.
Xref offsets are relative to the start of the PDF data, not to the start
of the PDF file. This is clear if you look at the other code:
- In the XRef's readXRefTable and processXRefTable methods of XRef, the
offset of a xref entry is set to the bytes as given by a PDF file.
These values are always relative to the start of the PDF file (%PDF-).
- The XRef's readXRef method adds the start offset of the stream to
Xref entry's offset: "stream.pos = startXRef + stream.start".
Clearly, this line assumes that the entry offset excludes the start
offset.
However, when the PDF is parsed in recovery mode, the xref table is
filled with entries whose offset is relative to the start of the stream
rather than the PDF file. This is incorrect, and the fix is to subtract
the start offset of the stream from the entry's byte offset.
The manually created PDF file serves as a regression test. It is a valid
PDF, except:
- The integer to point to the start of the xref table and the %%EOF
trailer are missing. This will activate recovery mode in PDF.js
- Some junk was added before the start of the PDF file. This exposes the
bad offset bug.
The PDF specification (cited below) specifies a maximum length of a name
in bytes as a minimal architectural limit. This means that PDF *writers*
should not create names that exceed 127 bytes.
It does not forbid PDF *readers* to accept such names though. These
names are only used internally to link PDF objects to other objects. For
these use cases, the lengths of the names do not really matter. Hence I
have changed the implementation to not treat long names as errors, but
warnings.
> (7.3.5) The length of a name shall be subject to an implementation
> limit; see Annex C.
>
> (Annex C.2) Table C.1 describes the minimum architectural limits that
> should be accommodated by conforming readers running on 32-bit
> machines. Because conforming readers may be subject to these limits,
> conforming writers producing PDF files should remain within them.
>
> (Table C.1) name 127 "Maximum length of a name, in bytes."
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
For named destinations that are contained in a `Dict`, as opposed to a `NameTree`, we currently iterate through the *entire* dictionary just to fetch *one* destination.
This code appears to simply have been copy-pasted from the `get destinations` method, but in its current form it's quite unnecessary/inefficient since can just get the required destination directly instead.
Doing this helped uncover an issue with the `getDestination` implementation.
Currently if a named destination doesn't exist, the method (in `obj.js`) may return `undefined` which leads to the promise being stuck in a pending state.
*Note:* returning `null` for this case is consistent with other methods, e.g. `getOutline` and `getAttachments`.
This became obsolete in bdeca30fbf. All it does is call the Annotation contructor and add hasHtml. This patch lets the Link and Text annotations directly extend the Annotation class and add hasHtml themselves.
This patch also removes an unused global.
Recently I've landed a number patches which fixed issues with ColorSpaces. In most of these cases the cause of the failures were, either partially or entirely, related to the fact that we didn't resolve indirect objects (i.e. the code was missing `xref.fetchIfRef(...)`).
The purpose of this patch is to fix the few remaining cases where indirect objects *could* potentially cause failures.
Given that we have seen how this causes failures in practice, I thus think that it makes sense to try and avoid further issues, instead of waiting for users to file even more bugs for this part of the code-base.
Fixes 6068.
The most notable issue with the font in question is that the `differences` array contains lots of strange entries (of the type `uniXXXX`, instead of proper glyph names).
The 'Version' field of the most recent document catalog, if present, is
intended to supersede the value in the file prologue.
This is significant for incrementally-built PDF documents and generators that
emit a low version in the prologue and later apply a format version based on
PDF features used, such as Apple's CoreGraphics/Quartz PDF backend.
Fixes the internal version variable, as well as the PDFFormatVersion reported
by the API and consumed by viewers.
For passwords where the encoding already is correct, the conversion is a no-op.
Also, since `encodeURIComponent` might throw, we need to make sure that we handle that case too.
Fixes 6010.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1050040.
With this patch the file is completely readable, but given that the font is broken enough to be rejected by OTS the rendering differs slightly from Adobe Reader.
*Note:* the PDF file is sufficiently broken that even Adobe Reader complains about the font, *and* also about another more general issue.
Currently if a font contains a `toUnicode` entry, we always create a new `ToUnicodeMap` in evaluator.js. This is done even for `IdentityV/IdentityH`, despite to possibility to use the much more compact `IdentityToUnicodeMap` representation.
This patch refactors the `IdentityH/IdentityV` cases, to:
- Avoid calling `IdentityCMap.getMap`, since this prevents allocating and iterating through an array with 65536 elements.
- Ensure that the handling of `toUnicode` is actually correct in fonts.js.
We rely on `toUnicode instanceof IdentityToUnicodeMap` in a few places, and currently this does not work correctly for `IdentityH/IdentityV`.
This is a very small follow-up to PR 5536, which sets `isStandardFont` to `false` instead of `undefined` (as currently happens for some font names).
Since the patch is so small, I hope it's OK to also fix an unrelated copy-and-paste error in a comment that was added in PR 5260.
This change does the following:
* Address TODO to remove getEmptyContainer helper.
* Not set container bg-color. The old code is incorrect, causing it to
not have any effect. It sets color to an array (item.color) rather
css string. Also in most cases it would set it to black background
which is incorrect.
* only add border instructions when there is actually a border
* reduce memory consumption by not creating new 3 element arrays for
annotation colors. In fact according to spec, this would be incorrect,
as the default should be "transparent" for an empty array. Adobe
Reader interprets a missing color array as black however.
Note that only Link annotations were actually setting a border style and
color. While Text annotations might have calculated a border they did
not color it. This behaviour is now controlled by the boolean flag.
According to practical experiments, falling back to "Helvetica" when we encounter a non-embedded "[Century Gothic](http://en.wikipedia.org/wiki/Century_Gothic)" `CIDFontType2` font seems to work well.
(Also, the section on Wikipedia about "Printer ink usage" *might* provide some anecdotal evidence that Century Gothic is a fairly standard sans-serif font.)
Obviously this patch doesn't make "Century Gothic" fonts render perfectly, as is often the case with non-embedded fonts, but all the text is now legible in the referenced issues.
Fixes 4722.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=879561.
Fixes two issues:
- #4456 : The first 100 bytes are often not unique as they can be
filled with standard PDF headers - so we use the first 200 KB instead.
(This may be overkill)
- Some documents we encountered have invalid xref ids, which were
always coming out as ‘0000000000000000’ - so we detect that and use the
MD5 instead.
This is a tentative patch that adds *very* basic support for non-embedded Wingdings fonts (a Windows version of Dingbats), by falling back to the ZapfDingbats encoding. Obviously this approach will not work perfectly, but in my opinion it seems to work reasonably well in pratice.
Instead of this very simple patch, another option would be to try and include more complete glyph data for Wingdings, e.g. a Unicode map and glyph widths, similar to what was done for ZapfDingbats.
However there is, in my opinion, one important difference between Wingdings and ZapfDingbats: ZapfDingbats is part of the 14 standard fonts, which in previous versions of the PDF specification was assumed to be available in PDF readers. To improve compatibility with older files, it thus makes sense for us to include data for ZapfDingbats.
However Wingdings has never been a standard font in PDF files, hence PDF files using it *should* thus contain all the necessary font data.
Given the above, I thus believe that it should be OK to fall back to ZapfDingbats for now. If non-embedded Wingdings fonts turns out to be *a lot* more common, then we can revisit this later.
Fixes 4301 completely.
Fixes 4837 almost completely. With this patch the bullets are displayed correctly, but the arrows are not of the correct type.
Fixes `artofwar.pdf`, pages 14 and 15.
As described in #5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.
This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A1 A2 A3 A4 would
be seen as A1 A2 A2 A2 by the evaluator, which prevents using the
"repeat" optimization. Also only the last encountered image is cached,
so A1 B1 A2 B2, would stay A1 B1 A2 B2.
The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low. It also caches any
eligible image by its adler32.
The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of #2618)
This patch makes the image from #5349 appear correctly, the artefacts
for the last packet are fixed in #5426.
This patch also optimizes some "in-checks" and adds a few header parsings.
After PR 5263, setting `disableAutoFetch = true` in the generic viewer no longer works correctly, since the entire file loads even with `disableStream = true`.
Currently when an exception is thrown, we try to reject `workerReadyCapability` with multiple arguments in src/core/api.js. This obviously doesn't work, hence this patch changes that to instead reject with the exception object as is.
In src/core/worker.js the exception is currently (unncessarily) wrapped in an object, so this patch also simplifies that to directly send the exception object instead.
This patch avoids creating many intermediate strings, when adding dummy width/lsb entries for glyphs where those are missing.
For the relevant PDF files in our test suite, the average number of intermediate strings are well over 1000.
The scanned, black-and-white document at
https://bugzilla.mozilla.org/show_bug.cgi?id=835380 doesn't benefit from
the critical GRAYSCALE_1BPP optimization because the optimization is
skipped if `needsDecode` is set.
This change addresses that, and reduces both rendering time and memory
usage for that document by almost 10x.
setGStateForKey() is a closure that serves no particularly useful
purpose. This change inlines it at the single call site. This avoids 1.7
MiB of allocations (because closures are objects) for the MTA map
mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=835380#c17.
For the document in #2504, 11% of the ops are `setGState` with a
`gStateObj` that is an empty array, which is a no-op. This is possible
because we ignore various setGState keys (OP, OPM, BG, etc).
This change prevents these ops from being inserted into the operator
list.
This allows the JS engine to do a better job of allocating the right
number of elements for the array, avoiding some resizings. For the PDF
in #2504, this avoids 100s of MiBs of allocations in Firefox.
makeInlineImage() has a "are the next five chars ASCII?" check which is
run after an "EI" sequence has been found. This check involves the
creation of a new object because peekBytes() calls subarray().
Unfortunately, the check is currently run on whitespace chars even when
an "EI" sequence has not yet been found, i.e. when it's not needed. For
the PDF in #2618, there are over 820,000 such checks.
This change reworks the relevant loop so that the check is only done
once an "EI" sequence has been seen. This reduces the number of checks
to 157,000, and speeds up rendering by somewhere between 2% and 7% (the
measurements are noisy).
The `console.log` statement in evaluator_spec.js is obviously not needed. In obj.js it could have been replaced by `info`, but that seemed unnecessary given the already existing `error`.
readCharCode() returns two values, and currently allocates a length-2
array on every call to do so. This change makes it instead us a
passed-in object which can be reused.
This tiny change reduces the total JS allocations done for the document
in Mozilla bug 992125 by 4.2%.
EvaluatorPreprocessor_read() is called in two cases. For the normal
layer, the args array it produces is used beyond the bounds of the loop
in which EvaluatorPreprocessor_read() is called.
But for the text layer, the args array is used in a very short-term
fashion. This change reworks things so that a single array is repeatedly
used for the text layer. This reduces total JS allocations for the
Spoorkaart map by 11%, and has similar effects on many other PDFs.
After PR 4982, the rendering of the first two pages of http://www.openmagazin.cz/pdf/2011/openMagazin-2011-04.pdf (from issue 215) no longer completes.
The issue is that we cannot have `args === null` in `PartialEvaluator_buildPath`, but *must* use an empty array instead.
In this patch I've also moved the `argsLength` variable definition in `EvaluatorPreprocessor_read`, to make sure that it's always defined.
When loading the PDF from issue #4935, this change reduces peak RSS from
~2400 to ~300 MiB, and improves overall speed by ~81%, from 6336 ms to
1222 ms.
IdentityCMap uses an array to represent a 16-bit unsigned identity
function. This is very space-inefficient, and some files cause multiple
IdentityCMaps to be instantiated (e.g. the one from #4580 has 74).
This patch make the representation implicit.
When loading the PDF from issue #4580, this change reduces peak RSS from
~370 to ~280 MiB. It also improves overall speed on that PDF by ~30%,
going from 522 ms to 366 ms.
cid chars are 16-bit unsigned integers. Currently we convert them to
single-char strings when inserting them into the CMap, and then convert
them back to integers when extracting them from the CMap. This patch
changes CMap so that cid chars stay in integer format throughout, saving
both time and space.
When loading the PDF from issue #4580, this change reduces peak RSS from
~600 to ~370 MiB. It also improves overall speed on that PDF by ~26%,
going from 724 ms to 533 ms.
This change avoids the element stringification caused by for..in for the
vast majority of CMaps.
When loading the PDF from issue #4580, this change reduces peak RSS from ~650
to ~600 MiB, and improves overall speed by ~20%, from 902 ms to 713 ms. Other
CMap-heavy documents will also see improvements.
This lets the JS engine resize the array elements buffer immediately,
thus avoiding some intermediate resizings. This can save multiple MiBs
of reallocation in text-heavy files.
In b5b94a4af3, i.e. PR #4259, we stopped using cidmaps.js. Despite that, it's still included when PDF.js is built. At almost 0.5 MB (and approx. 7000 lines), this is currently the single largest file in the codebase.
Including such a large file in the builds, when it is not actually used, seems extremely wasteful; hence this patch.
I have a large PDF where this function is called 1.6 million times
during loading. Minimizing the string concatenations reduces the
cumulative allocations done by Firefox within this function from 113 MB
to 48 MB.
QueueOptimizer is really hard to read. Enough so that it's blocking my
efforts to streamline the representation used for operator lists.
This patch improves its readability in the following ways.
- More descriptive variable names make the sequence checking much clearer,
as do additional comments.
- The addState() functions now return the index of the first op past the
sequence, instead of setting context.currentOperation to the last op of
the sequence.
- The loop in optimize() is clearer.
- The array modification in the fourth addState() function is much clearer
-- we're just removing trios of ops.
- All four |addState| functions are now more consistent with each other.
I used some debug printfs to find documents where these optimizations are
used and then checked that the number of optimized ops was the same before
and after my changes.
DecodeStream currently initializes its |buffer| field to |null|, which
is reasonable, because lots of DecodeStreams never need to instantiate a
buffer. But this requires various special cases in the code.
This patch change it so DecodeStreamClosure has a single empty
Uint8Array which gets shared between all buffers upon initialization.
This avoids the special cases.
DecodeStream.prototype.ensureBuffer() is really hot, and this removes a
test from the fast path. For one 226 page scanned document this sped up
rendering by about 2%.
In src/core/obj.js, we convert a Ref to a string to index into a table like
this: 'R1.0'. This conversion is repeated numerous times.
This patch factors out the conversion into a new function.
Ref.prototype.toString().
This function can be called 100s of 1000s or even millions of times, and the
allocated return object accounts for 10% of all GC thing allocations for some
documents. It's easy to avoid, which reduces stress on the garbage collector,
and this patch does that.
PartialEvaluator.getTextContent() builds up textChunk strings 1 char at a time,
creating many 100s of 1000s of intermediate strings along the way. This patch
make it instead push chars to an array and then join them at the end, as we
have done in numerous other places.
This new function is much faster than ensureRange(pos, pos+1), which is a very
common case.
This speeds up the rendering of some test cases (including the Tracemonkey
paper) by 4--5%.
*Added AES128 Encryption
*Added AES258 Encryption/Decryption
*Added SHA256
*Added SHA512
*Added class to handle 8 byte integers and associated bit operations
*Added SHA384
*Added routines to handle new algorithm and perform PDF2.0 hashing.
Profiling showed that receiveAndExtend is frequently called with the length of
one bit. This happens for example in decodeBaseline.
For a single bit, the loop and shift in receive, as well as the shifts in
receiveAndExtend are overhead.
This shortcut manually calculates the shifts by either returning 1 or -1 from
receiveAndExtend by reading the bit and deciding on the return value.
While it comes with an overhead for each non-one length, the speedup is at about
10% in the hot parse/decode path.
This change makes the 4 conversion loops look the same.
It optimizes access of the array length and access of the property
numComponents, which is known to be constant.
This refactors getData to be more readable and extracts all the color
conversion algorithms to their own functions. The resulting code was then
cleaned up.
This also introduces a flag `forceRGBoutput` to getData, that allows to always
get the data as a `width * height * 3` bytes long RGB buffer
The function `assertWllFormed` was doing nothing different than `assert` which is
available in the same namespace. Removing it will lighten the filesize - albeit
very slightly - and reduce complexity.
Different fonts can point to the same font descriptor
(see https://github.com/mozilla/pdf.js/issues/4339 for details). With this
commit such fonts are treated as aliases if they have also the same encoding
and the same toUnicode map. The according info is stored on the font descriptor.
This change must also ensure that aliases use always the same font name
because translated fonts can get cleared depending on the CLEANUP_TIMEOUT setting.
This makes the code much simpler, and the extra memory use is tiny -- a vanilla
1000 element array is only 4000 bytes larger than a Uint32Array of the same
size.
This is achieved by adding getBytes2() and getBytes4() to streams, and by
changing int16() and int32() to take multiple scalar args instead of an array
arg.
This reduces memory consumption for text heavy documents. I tested five
documents and saw hit rates ranging from 97.4% to 99.8% (most of the misses are
due to |width| varying even when |fontChar| matches). On two of those documents
I saw improvements of 40 and 50 MiB.
The patch also introduces the Glyph constructor, and renames the |unicodeChars|
local variable as |unicode| for consistency with the corresponding Glyph
property.
There is no need to slow down the inner loop with a test for ltp as it can only
change if prediction is true in which case it only changes in the outer loop.
When decoding a stream, the decode buffer is often grown multiple times, its
byte size increasing like so: 512, 1024, 2048, etc. This patch estimates the
minimum size in advance (using the length of the encoded stream), often
allowing the smaller sizes to be skipped. It also renames numerous |length|
variables as |maybeLength| to make it clear that they can be |null|.
I measured this change on eight documents. This change reduces the cumulative
size of decode buffer allocations by 0--32%, with 10--20% being typical. This
reduces peak RSS by 10 or 20 MiB for several of them.
This avoids lots of unnecessary work when such streams are referred to via
fetch(), and so their bytes aren't subsequently read. This is a large
performance win on some files.
By checking if the data is all present before making a substream, we avoid
cases where we parse part of a stream and then throw a MissingDataException
part-way through, which forces us to later re-read the stream -- possibly
multiple times. This is a sizeable performance win for some cases when file
loading is slow (e.g. over the web).
Added an "InteractiveAnnotation" class to homogenize the annotations' structure (highlighting) and user interactions (for now, used for text and link annotations).
Text annotations:
The appearance (AP) has priority over the icon (Name).
The popup extends horizontally (up to a limit) as well as vertically.
Reduced the title's font size.
The annotation's color (C) is used to color the popup's background.
On top of the mouseover show/hide behavior, a click on the icon will lock the annotation open (for mobile purposes). It can be closed with another click on either the icon or the popup.
An annotation printing is conditioned by its "print" bit
Unsupported annotations are not displayed at all.
- remove unused "if (e instanceof Stream)" in XRef_fetch
- split XRef_fetch into XRef_fetchUncompressed and XRef_fetchCompressed
This explains the actual code flow better
- add line breaks after functions
- change the document to conform to current StyleGuide
Now, it computes the numbers with only basic arithmetic operations, without first creating a string and then calling parseFloat.
The new function doesn't behave exactly the same as the old one.
In particular, the old behaviour was that when there was a number immediatly followed by an 'E', the 'E' was consumed. Now it's not. It allows for "glued" numbers and operators.
Also, the new function is faster and consumes less memory.
At the initialization of `Lexer_getObj` (in `parser.js`), there's a loop
that skips whitespace and breaks out whenever EOF is encountered.
(https://github.com/mozilla/pdf.js/blob/88ec2bd1a/src/core/parser.js#L586-L599)
Whenever the current character is not a whitespace character,
`ch = this.nextChar();` is used to find the next character
(using `return this.currentChar = this.stream.getByte())`).
The aforementioned `getByte` method retrieves the next byte using
(https://github.com/mozilla/pdf.js/blob/88ec2bd1a/src/core/stream.js#L122-L128)
var pos = this.pos;
while (this.bufferLength <= pos) {
if (this.eof)
return -1;
this.readBlock();
}
return this.buffer[this.pos++];
This piece of code relies on this.eof to detect whether the last character
has been read. When the stream is a `FlateStream`, and the end of the stream
has been reached, then **`this.eof` is not set to `true`**, because this check
is done inside a loop that does not occur when the read block size is zero:
(https://github.com/mozilla/pdf.js/blob/88ec2bd1ac/src/core/stream.js#L511-L517)
for (var n = bufferLength; n < end; ++n) {
if (typeof (b = bytes[bytesPos++]) == 'undefined') {
this.eof = true;
break;
}
buffer[n] = b;
}
This commit fixes the issue by setting this.eof to true whenever the loop is not
going to run (i.e. when bufferLength === end, i.e. blockLen === 0).