pdf.js/external/cmapscompress
Jonas Jenwald dcd55a7164 Enable unicorn/prefer-at unconditionally (PR 15014 follow-up)
Now that Node.js version 18 is required, we should be able to use `Array.prototype.at()` everywhere in the code-base.
2023-05-07 13:43:19 +02:00
..
compress.js Enable unicorn/prefer-at unconditionally (PR 15014 follow-up) 2023-05-07 13:43:19 +02:00
optimize.js Enable unicorn/prefer-at unconditionally (PR 15014 follow-up) 2023-05-07 13:43:19 +02:00
parse.js Enable the unicorn/prefer-string-replace-all ESLint plugin rule 2023-03-23 12:57:10 +01:00
README.md Fix typos 2018-04-01 23:20:41 +02:00

Quick notes about binary CMap format (bcmap)

The format is designed to package some information from the CMap files located at external/cmap. Please notice for size optimization reasons, the original information blocks can be changed (split or joined) and items in the blocks can be swapped.

The data stored in binary format in network byte order (big-endian).

Data primitives

The following primitives used during encoding of the file:

  • byte (B) a byte, bits are numbered from 0 (less significant) to 7 (most significant)
  • bytes block (B[n]) a sequence of n bytes
  • unsigned number (UN) the number is encoded as sequence of bytes, bit 7 is flag to continue decoding the byte, bits 6-0 store number information, e.g. bytes 0x818407 will represent 16903 (0x4207). Limited to the 32 bit.
  • signed number (SN) the number is encoded as sequence of bytes, as UN, however shall be transformed before encoding: if n < 0, the n shall be encoded as (-2n-1) using UN encoding, other n shall be encoded as (2n) using UN encoding. So the lowest bit of the number indicates the sign of the initial number
  • unsigned fixed number (UB[n]) similar to the UN, but it represents an unsigned number that is stored in B[n]
  • signed fixed number (SB[n]) similar to the SN, but it represents a signed number that is stored in B[n]
  • string (S) the string is encoded as sequence of bytes. First comes length is characters encoded as UN, when UTF16 characters encoded as UN.

File structure

The first byte is a header:

  • bits 2-1 indicate a CMapType. Valid values are 1 and 2
  • bit 0 indicate WMode. Valid values are 0 and 1.

Then records follow. The records starts from the record header encoded as B, where bits 7-5 indicate record type (see description of other bits below):

  • 0 codespacerange
  • 1 notdefrange
  • 2 cidchar
  • 3 cidrange
  • 4 bfchar
  • 5 bfrange
  • 6 reserved
  • 7 metadata

Metadata record

The metadata record header bit 4-0 contain id of the metadata:

  • 0 comment, body of the record is encoded comment string (S)
  • 1 UseCMap, body of the record is usecmap id string (S)

Data records

The records that have types 0 5, have the following fields in the header:

  • bit 4 indicate the char or start/end entries are stored in a sequence in this block
  • bits 3-0 contain length of the data size minus 1 in this block (dataSize)

The amount of entries encoded as UN follows the header. The items records follow (see below).

codespacerange (0)

Represents the following CMap block:

n begincodespacerange endcodespacerange

First record format is:

  • start as B[dataSize]
  • endDelta as UB[dataSize], end is calculated as (start + endDelta)

Next record format is:

  • startDelta as UB[dataSize], start = end + startDelta
  • endDelta as UB[dataSize], end = start + endDelta

notdefrange (1)

Represents the following CMap block:

n beginnotdefrange code endnotdefrange

First record format is:

  • start as B[dataSize]
  • endDelta as UB[dataSize], end is calculated as (start + endDelta)
  • code as UN

Next record format is:

  • startDelta as UB[dataSize], start = end + startDelta
  • endDelta as UB[dataSize], end = start + endDelta
  • code as UN

cidchar (2)

Represents the following CMap block:

n begincidchar code endcidchar

First record format is:

  • char as B[dataSize]
  • code as UN

Next record format is:

  • if sequence = 0, charDelta as UB[dataSize], char = char + charDelta + 1
  • if sequence = 1, char = char + 1
  • codeDelta as SN, code = code + codeDelta

cidrange (3)

Represents the following CMap block:

n begincidrange code endcidrange

First record format is:

  • start as B[dataSize]
  • endDelta as UN[dataSize], end is calculated as (start + endDelta)
  • code as UN

Next record format is:

  • if sequence = 0, startDelta as UB[dataSize], start = end + startDelta + 1
  • if sequence = 1, start = end + 1
  • endDelta as UN[dataSize], end = start + endDelta
  • code as UN

bfchar (4)

Represents the following CMap block:

n beginbfchar endbfchar

First record format is:

  • char as B[ucs2Size], where ucs2Size = 2 (here and below)
  • code as B[dataSize]

Next record format is:

  • if sequence = 0, charDelta as UN[ucs2Size], char = charDelta + charDelta + 1
  • if sequence = 1, char = char + 1
  • codeDelta as SB[dataSize], code = code + codeDelta

bfrange (5)

Represents the following CMap block:

n beginbfrange endbfrange

First record format is:

  • start as B[ucs2Size]
  • endDelta as UB[ucs2Size], end is calculated as (start + endDelta)
  • code as B[dataSize]

Next record format is:

  • if sequence = 0, startDelta as UB[ucs2Size], start = end + startDelta + 1
  • if sequence = 1, start = end + 1
  • endDelta as UB[ucs2Size], end = start + endDelta
  • code as B[dataSize]