In some fonts, the included `ToUnicode` data is incomplete causing text-selection to not work properly. For simple fonts that contain encoding data, we can manually build a `ToUnicode` map to attempt to improve things.
Please note that since we're currently using the `ToUnicode` data during glyph mapping, in an attempt to avoid rendering regressions, I purposely didn't want to amend to original `ToUnicode` data for this text-selection edge-case.
Instead, I opted for the current solution, which will (hopefully) give slightly better text-extraction results in PDF file with incomplete `ToUnicode` data.
According to the PDF specification, see [section 9.10.2](http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1873172):
> A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value.
> ...
Reading that paragraph literally, it doesn't seem too unreasonable to use *different* methods for different charcodes.
Fixes 8229.
We're currently disabling range requests and streaming for a number of configurations. A couple of those will no longer be supported (with PDF.js version 2.0), hence we ought to be able to clean up the compatibility code slightly.
This compatibility code is only relevant for browsers that will no longer be supported (with PDF.js version 2.0), hence we ought to be able to remove it.
*This patch is the result of me starting to look into moving parameters from `PDFJS` into `getDocument` and other API methods.*
When familiarizing myself with the code, the signatures of the various network streams seemed to be unnecessarily cumbersome since `disableRange` is currently handled separately from other parameters.
I'm assuming that the explanation for this is probably "for historical reasons", as is often the case. Hence I'd like to clean this up *before* we start the larger, and more invasive, `PDFJS` parameter re-factoring.
The interface of all of the "image" streams look kind of weird, and I'm actually a bit surprised that there hasn't been any errors because of it.
For example: None of them actually implement `readBlock` methods, and it seems more luck that anything else that we're not calling `getBytes()` (without providing a length) for those streams, since that would trigger a code-path in `getBytes` that assumes `readBlock` to exist.
To address this long-standing issue, the `ensureBuffer` methods are thus renamed to `readBlock`. Furthermore, the new `ensureBuffer` methods are now no-ops.
Finally, this patch also replaces `var` with `let` in a number of places.
In the PDF file, the `ToUnicode` data first maps the hyphen correctly, and then *overwrites* it to point to the softhyphen instead. That one cannot be rendered in browsers, and an empty space thus appear instead.
Fixes 9084.
Nothing uses this option anymore, so setting it is a no-op now. We can
safely remove it.
Use `SKIP_BABEL` (instead of `PDFJS_NEXT`) now if you want to skip Babel
translation for a build.