[FFmpeg-trac] #6021(avcodec:new): tx3g / mov_text subtitles are not encoded correctly in some specific cases
FFmpeg
trac at avcodec.org
Sat Dec 17 08:35:09 EET 2016
#6021: tx3g / mov_text subtitles are not encoded correctly in some specific cases
---------------------------------------+-----------------------------------
Reporter: erikbs | Owner:
Type: defect | Status: new
Priority: normal | Component: avcodec
Version: git-master | Resolution:
Keywords: utf8 mov_text | Blocked By:
Blocking: | Reproduced by developer: 0
Analyzed by developer: 0 |
---------------------------------------+-----------------------------------
Comment (by erikbs):
New patch submitted for correctly decoding styles when multibyte UTF-8
characters are involved.
About UTF-16:
Given that the byte length, {{{uint64_t L}}}, of the string, {{{char
*text}}}, is known, the number of UTF-16 characters in the string can be
calculated as follows:
{{{
uint64_t utf16_char_len(const char *text, uint64_t L) {
uint64_t l = 0;
uint16_t c = 0, start = 0;
uint16_t m[2] = {0xFC00, 0xDC00}; // Bit masks
if (L >= 2) c = ((uint16_t)text[0] << 8) + (uint8_t)text[1];
switch (c) {
case 0xFFFE: // Little Endian, swap mask byte order
m[0] = 0x00FC; m[1] = 0x00DC;
case 0xFEFF:
start = 2; // Skip the BOM
default:
for (uint64_t i = start; i < L; i += 2)
if (((((uint16_t)text[i] << 8) |
(uint8_t )text[i + 1] ) & m[0]) != m[1]) l++;
}
return l;
}
}}}
This code expects to be fed valid UTF-16 data and assumes Big Endian when
no BOM is present.
The format specification only requires Big Endian support, but it demands
that the BOM be present for UTF-16. Exactly how this is supposed to be
encoded I don’t know.
--
Ticket URL: <https://trac.ffmpeg.org/ticket/6021#comment:6>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker
More information about the FFmpeg-trac
mailing list