feat(string): add UTF-8 string conversion and validation functions#2528
feat(string): add UTF-8 string conversion and validation functions#2528bobtista wants to merge 11 commits into
Conversation
|
| Filename | Overview |
|---|---|
| Core/Libraries/Source/WWVegas/WWLib/utf8.h | New header declaring UTF-8/UTF-16LE conversion functions; correctly uses #pragma once, clear documentation, and Windows-only note. |
| Core/Libraries/Source/WWVegas/WWLib/utf8.cpp | New Win32-backed conversion implementation; prior review threads raised overlong encoding and surrogate rejection gaps in a Utf8_Validate stub that was removed from this iteration. Remaining conversion functions are straightforward wrappers. |
| Core/GameEngine/Source/Common/System/AsciiString.cpp | translate() replaced with proper UTF-8 conversion; also includes an unrelated fix moving null-terminator write outside the strToCopy guard in ensureUniqueBufferOfSize, which is a real bugfix but changes pre-existing behavior. |
| Core/GameEngine/Source/Common/System/UnicodeString.cpp | Mirror of AsciiString.cpp changes; translate() and ensureUniqueBufferOfSize null-terminator fix applied consistently. |
| Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp | Replaces raw Win32 calls with WWLib wrappers; eliminates separate heap allocations and correctly checks conversion return values. Empty-string early return is a behaviour-identical replacement. |
| Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt | Adds utf8.cpp/h to unconditional WWLIB_SRC list; prior thread noted this should be inside the if(WIN32) block, but dev replied the #error placeholder is intentional for now. |
Sequence Diagram
%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Caller
participant translate
participant Utf_Len as Utf16Le_To_Utf8_Len / Utf8_To_Utf16Le_Len
participant Win32 as WideCharToMultiByte / MultiByteToWideChar
participant Utf_Conv as Utf16Le_To_Utf8 / Utf8_To_Utf16Le
participant Buffer as AsciiString / UnicodeString Buffer
Caller->>translate: translate(src)
translate->>Utf_Len: query required output size (srcLen chars)
Utf_Len->>Win32: "Win32 query call (destLen=0)"
Win32-->>Utf_Len: byte/wchar count (or 0 on fail/empty)
Utf_Len-->>translate: len
alt "len == 0 (empty or failure)"
translate->>Buffer: clear()
translate-->>Caller: return
else "len > 0"
translate->>Buffer: ensureUniqueBufferOfSize(len+1)
translate->>Utf_Conv: "convert to buffer (destLen=len+1)"
Utf_Conv->>Win32: Win32 convert call
Win32-->>Utf_Conv: written chars (or 0 on fail)
Utf_Conv-->>translate: "written (0 = failure)"
alt "written == 0 (failure)"
translate->>Buffer: clear()
end
translate->>Buffer: validate()
translate-->>Caller: return
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Caller
participant translate
participant Utf_Len as Utf16Le_To_Utf8_Len / Utf8_To_Utf16Le_Len
participant Win32 as WideCharToMultiByte / MultiByteToWideChar
participant Utf_Conv as Utf16Le_To_Utf8 / Utf8_To_Utf16Le
participant Buffer as AsciiString / UnicodeString Buffer
Caller->>translate: translate(src)
translate->>Utf_Len: query required output size (srcLen chars)
Utf_Len->>Win32: "Win32 query call (destLen=0)"
Win32-->>Utf_Len: byte/wchar count (or 0 on fail/empty)
Utf_Len-->>translate: len
alt "len == 0 (empty or failure)"
translate->>Buffer: clear()
translate-->>Caller: return
else "len > 0"
translate->>Buffer: ensureUniqueBufferOfSize(len+1)
translate->>Utf_Conv: "convert to buffer (destLen=len+1)"
Utf_Conv->>Win32: Win32 convert call
Win32-->>Utf_Conv: written chars (or 0 on fail)
Utf_Conv-->>translate: "written (0 = failure)"
alt "written == 0 (failure)"
translate->>Buffer: clear()
end
translate->>Buffer: validate()
translate-->>Caller: return
end
Reviews (14): Last reviewed commit: "fix(string): return empty string when Th..." | Re-trigger Greptile
|
xezon
left a comment
There was a problem hiding this comment.
Get_Utf8_Size should not include the null terminator in its size.
| if (dest_size == 0) | ||
| return; | ||
| return false; | ||
| int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size); |
There was a problem hiding this comment.
What happens if dest_size does not have enough room for a null terminator?
There was a problem hiding this comment.
The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?
There was a problem hiding this comment.
Maybe make it behave like strncpy? Writes null if there is room, otherwise not.
The only issue with the current interface then is that we will not know if it wrote the null terminator. Maybe it should return size_t instead, returning the number of characters it writes or would like to write? MultiByteToWideChar also does that.
I suggest to think this through and design the function interface in a way that it can be conveniently be used for fixed size strings (std::string, AsciiString) and large throwaway buffers (char arr[512]).
The behavior definitely needs to be documented.
|
The diff now shows unrelated changes. |
39d7229 to
40393b8
Compare
Try again cleaned up the commits and force pushed |
| } | ||
| ensureUniqueBufferOfSize((Int)size + 1, false, nullptr, nullptr); | ||
| char* buf = peek(); | ||
| if (!Unicode_To_Utf8(buf, src, srcLen, size)) |
There was a problem hiding this comment.
So is this translating UTF16LE from windows into UTF8 that is then stored within AsciiString?
If so this may help with the paths issue with usernames and paths not using Latin characters, but file handling functions will need updating to use unicode variants instead of Ascii.
|
I wonder if we should also add a flag to state that the Ascii string is holding a UTF8 string? I guess all normal ascii characters will display properly, it's just extended character sets that will look garbled. |
…ming, null-terminate when room
|
Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else. |
Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8? |
If anything it would be Utf16Le_To_Utf8, windows uses the little endian utf16 format. Not sure if Utf16Be is used much anywhere but worth being concise with it. |
| return required; | ||
| } | ||
| } | ||
| WWASSERT(destLen >= Utf16Le_To_Utf8_Len(src, srcLen)); |
There was a problem hiding this comment.
Can we perhaps use written to assert with, instead of another call to WideCharToMultiByte ?
| @@ -109,24 +112,9 @@ size_t Utf8_To_Utf16Le_Len(const char* src, size_t srcLen) | |||
| return (wchars > 0) ? (size_t)wchars : 0; | |||
There was a problem hiding this comment.
Maybe do >=, so the branch predictor is 100% correct.
| @@ -109,24 +112,9 @@ size_t Utf8_To_Utf16Le_Len(const char* src, size_t srcLen) | |||
| } | ||
| WWASSERT(destLen >= Utf16Le_To_Utf8_Len(src, srcLen)); | ||
| const int written = WideCharToMultiByte(CP_UTF8, 0, src, (int)srcLen, dest, (int)destLen, nullptr, nullptr); | ||
| if (written <= 0) |
There was a problem hiding this comment.
Is this now a contradiction to the assert? Would this only be true if the assert was failing?
There was a problem hiding this comment.
Yeah this branch is dead code when the assert holds, but WWASSERT compiles out in release, so do we keep it?
There was a problem hiding this comment.
You can replace this entire branch and substitute it with written = max(0, written)
|
Could the unconditional replacement of translate() silently break callers passing legacy CP1252 data, causing strings that are valid CP1252 but invalid UTF-8 to corrupt or clear? Maybe something like: |
|
Maybe put a breakpoint and check usage patterns. |
|
What is the current state of this? Can it be finalized? |
Yeah, I just pushed cleanups addressing the open threads, it should be ready for another pass |
Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.
Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.
New:
WWLib/utf8.h/utf8.cppUtf8_Num_Bytes(char lead)— byte count of a UTF-8 character from its lead byteUtf8_Trailing_Invalid_Bytes(const char* str, size_t length)— count of invalid trailing bytes due to an incomplete multi-byte sequenceUtf8_Validate(const char* str)/Utf8_Validate(const char* str, size_t length)— returns true if the string is valid UTF-8 per RFC 3629 (rejects overlong encodings and codepoints above U+10FFFF)Utf16Le_To_Utf8_Len(const wchar_t* src, size_t srcLen)/Utf8_To_Utf16Le_Len(const char* src, size_t srcLen)— required output size, not counting null terminatorUtf16Le_To_Utf8(char* dest, size_t destLen, const wchar_t* src, size_t srcLen)Utf8_To_Utf16Le(wchar_t* dest, size_t destLen, const char* src, size_t srcLen)Naming follows the
Snake_Caseconvention used in WWVegas. The conversion functions return the number of units required: if the return is<= destLenthe conversion was written (with a null terminator if room remains); if> destLenthe buffer was too small and the return value tells the caller how much to allocate;0indicates a conversion failure. Implementation is Windows-only and treatswchar_tas UTF-16LE, wrapping Win32WideCharToMultiByte/MultiByteToWideChar.AsciiString::translate/UnicodeString::translateReplaces the broken implementations that only worked for 7-bit ASCII (marked
@todosince the original code) with proper UTF-8 conversion using the new WWLib functions.ThreadUtils.cppReplaces raw Win32 API calls in
MultiByteToWideCharSingleLineandWideCharStringToMultiBytewith the new WWLib functions, usingstd::string::resize/std::wstring::resizeto avoid duplicate allocation.