Only decode text direction entities in Sub files

Previously, all entities were decoded in Subtitle files because of a problem with SubtitleEdit and it's /ReverseRtlStartEnd option not being entity-aware.

It actually ends up reversing the `;` of `&rlm;`, instead of the actual value of `&rlm;`. Therefore, I decoded all entities before SubtitleEdit could have processed the Subtitle, but this has caused problems with more advanced formats like TTML and WebVTT as `&lt;` would decode to `<` causing syntax errors, among other problematic characters.

According to the TTML and WebVTT spec, html entity encoding is allowed, and that makes sense or you wouldn't be able to use `<` etc. Any failure for players to show the decoded character would be a player problem and be out of scope with Devine.
This commit is contained in:
rlaphoenix 2024-02-05 12:37:21 +00:00
parent 568cb616df
commit 167b45475e
1 changed files with 5 additions and 1 deletions

View File

@ -316,7 +316,11 @@ class HLS:
if isinstance(track, Subtitle): if isinstance(track, Subtitle):
segment_data = try_ensure_utf8(segment_data) segment_data = try_ensure_utf8(segment_data)
if track.codec not in (Subtitle.Codec.fVTT, Subtitle.Codec.fTTML): if track.codec not in (Subtitle.Codec.fVTT, Subtitle.Codec.fTTML):
segment_data = html.unescape(segment_data.decode("utf8")).encode("utf8") # decode text direction entities or SubtitleEdit's /ReverseRtlStartEnd won't work
segment_data = segment_data.decode("utf8"). \
replace("&lrm;", html.unescape("&lrm;")). \
replace("&rlm;", html.unescape("&rlm;")). \
encode("utf8")
f.write(segment_data) f.write(segment_data)
segment_file.unlink() segment_file.unlink()