feat(string): add UTF-8 string conversion and validation functions by bobtista · Pull Request #2528 · TheSuperHackers/GeneralsGameCode

bobtista · 2026-04-03T00:07:30Z

Relates to refactor(string): Add functions for handling UTF8 encoded strings #2045

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: `WWLib/utf8.h` / `utf8.cpp`

Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8 per RFC 3629 (rejects overlong encodings and codepoints above U+10FFFF)
Utf16Le_To_Utf8_Len(const wchar_t* src, size_t srcLen) / Utf8_To_Utf16Le_Len(const char* src, size_t srcLen) — required output size, not counting null terminator
Utf16Le_To_Utf8(char* dest, size_t destLen, const wchar_t* src, size_t srcLen)
Utf8_To_Utf16Le(wchar_t* dest, size_t destLen, const char* src, size_t srcLen)

Naming follows the Snake_Case convention used in WWVegas. The conversion functions return the number of units required: if the return is <= destLen the conversion was written (with a null terminator if room remains); if > destLen the buffer was too small and the return value tells the caller how much to allocate; 0 indicates a conversion failure. Implementation is Windows-only and treats wchar_t as UTF-16LE, wrapping Win32 WideCharToMultiByte / MultiByteToWideChar.

`AsciiString::translate` / `UnicodeString::translate`

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

`ThreadUtils.cpp`

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions, using std::string::resize / std::wstring::resize to avoid duplicate allocation.

greptile-apps · 2026-04-03T00:12:52Z

Greptile Summary

This PR introduces WWLib/utf8.h and utf8.cpp with Win32-backed UTF-16LE ↔ UTF-8 conversion utilities, and replaces the long-standing ASCII-only translate() stubs in AsciiString / UnicodeString with proper UTF-8 conversions using the new functions. ThreadUtils.cpp's raw Win32 calls are also refactored to use the shared helpers, eliminating manual heap allocations.

utf8.cpp wraps WideCharToMultiByte / MultiByteToWideChar with a clear size-query/convert two-step API; the #ifdef _WIN32 … #else #error #endif pattern intentionally gates non-Windows builds.
AsciiString::translate and UnicodeString::translate now correctly handle multi-byte Unicode content, and an unrelated null-terminator placement fix in ensureUniqueBufferOfSize (moved outside the strToCopy guard) is included.
MultiByteToWideCharSingleLine and WideCharStringToMultiByte in ThreadUtils.cpp now check the conversion return value and avoid the previous separate heap allocation, addressing the concern raised in a prior review thread.

Confidence Score: 5/5

The conversion functions are straightforward Win32 wrappers with no new memory management risk, and the translate() rewrites correctly handle empty strings and conversion failures by clearing the target.

All changed paths — translate(), ensureUniqueBufferOfSize(), MultiByteToWideCharSingleLine(), WideCharStringToMultiByte() — have been reviewed against their callers. The null-terminator placement fix is correct and the conversion return-value checks resolve the issue flagged in a prior thread. No functional regressions or new bugs were found beyond a minor readability nit on the >= 0 guard.

No files require special attention; the only outstanding remark is a cosmetic guard condition in utf8.cpp.

Important Files Changed

Filename	Overview
Core/Libraries/Source/WWVegas/WWLib/utf8.h	New header declaring UTF-8/UTF-16LE conversion functions; correctly uses #pragma once, clear documentation, and Windows-only note.
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp	New Win32-backed conversion implementation; prior review threads raised overlong encoding and surrogate rejection gaps in a Utf8_Validate stub that was removed from this iteration. Remaining conversion functions are straightforward wrappers.
Core/GameEngine/Source/Common/System/AsciiString.cpp	translate() replaced with proper UTF-8 conversion; also includes an unrelated fix moving null-terminator write outside the strToCopy guard in ensureUniqueBufferOfSize, which is a real bugfix but changes pre-existing behavior.
Core/GameEngine/Source/Common/System/UnicodeString.cpp	Mirror of AsciiString.cpp changes; translate() and ensureUniqueBufferOfSize null-terminator fix applied consistently.
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp	Replaces raw Win32 calls with WWLib wrappers; eliminates separate heap allocations and correctly checks conversion return values. Empty-string early return is a behaviour-identical replacement.
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt	Adds utf8.cpp/h to unconditional WWLIB_SRC list; prior thread noted this should be inside the if(WIN32) block, but dev replied the #error placeholder is intentional for now.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant translate
    participant Utf_Len as Utf16Le_To_Utf8_Len / Utf8_To_Utf16Le_Len
    participant Win32 as WideCharToMultiByte / MultiByteToWideChar
    participant Utf_Conv as Utf16Le_To_Utf8 / Utf8_To_Utf16Le
    participant Buffer as AsciiString / UnicodeString Buffer

    Caller->>translate: translate(src)
    translate->>Utf_Len: query required output size (srcLen chars)
    Utf_Len->>Win32: "Win32 query call (destLen=0)"
    Win32-->>Utf_Len: byte/wchar count (or 0 on fail/empty)
    Utf_Len-->>translate: len
    alt "len == 0 (empty or failure)"
        translate->>Buffer: clear()
        translate-->>Caller: return
    else "len > 0"
        translate->>Buffer: ensureUniqueBufferOfSize(len+1)
        translate->>Utf_Conv: "convert to buffer (destLen=len+1)"
        Utf_Conv->>Win32: Win32 convert call
        Win32-->>Utf_Conv: written chars (or 0 on fail)
        Utf_Conv-->>translate: "written (0 = failure)"
        alt "written == 0 (failure)"
            translate->>Buffer: clear()
        end
        translate->>Buffer: validate()
        translate-->>Caller: return
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant translate
    participant Utf_Len as Utf16Le_To_Utf8_Len / Utf8_To_Utf16Le_Len
    participant Win32 as WideCharToMultiByte / MultiByteToWideChar
    participant Utf_Conv as Utf16Le_To_Utf8 / Utf8_To_Utf16Le
    participant Buffer as AsciiString / UnicodeString Buffer

    Caller->>translate: translate(src)
    translate->>Utf_Len: query required output size (srcLen chars)
    Utf_Len->>Win32: "Win32 query call (destLen=0)"
    Win32-->>Utf_Len: byte/wchar count (or 0 on fail/empty)
    Utf_Len-->>translate: len
    alt "len == 0 (empty or failure)"
        translate->>Buffer: clear()
        translate-->>Caller: return
    else "len > 0"
        translate->>Buffer: ensureUniqueBufferOfSize(len+1)
        translate->>Utf_Conv: "convert to buffer (destLen=len+1)"
        Utf_Conv->>Win32: Win32 convert call
        Win32-->>Utf_Conv: written chars (or 0 on fail)
        Utf_Conv-->>translate: "written (0 = failure)"
        alt "written == 0 (failure)"
            translate->>Buffer: clear()
        end
        translate->>Buffer: validate()
        translate-->>Caller: return
    end

_{Reviews (14): Last reviewed commit: "fix(string): return empty string when Th..." | Re-trigger Greptile}

bobtista · 2026-04-03T02:32:50Z

Fixed the if formatting
added RFC 3629 overlong and out-of-range checks
RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

xezon

Get_Utf8_Size should not include the null terminator in its size.

xezon · 2026-04-05T12:17:27Z

 	if (dest_size == 0)
-		return;
+		return false;
 	int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);


What happens if dest_size does not have enough room for a null terminator?

The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?

Maybe make it behave like strncpy? Writes null if there is room, otherwise not.

The only issue with the current interface then is that we will not know if it wrote the null terminator. Maybe it should return size_t instead, returning the number of characters it writes or would like to write? MultiByteToWideChar also does that.

I suggest to think this through and design the function interface in a way that it can be conveniently be used for fixed size strings (std::string, AsciiString) and large throwaway buffers (char arr[512]).

The behavior definitely needs to be documented.

xezon · 2026-04-06T19:57:25Z

The diff now shows unrelated changes.

bobtista · 2026-04-06T20:04:14Z

The diff now shows unrelated changes.

Try again cleaned up the commits and force pushed

Mauller · 2026-04-07T07:35:06Z

+	}
+	ensureUniqueBufferOfSize((Int)size + 1, false, nullptr, nullptr);
+	char* buf = peek();
+	if (!Unicode_To_Utf8(buf, src, srcLen, size))


So is this translating UTF16LE from windows into UTF8 that is then stored within AsciiString?

If so this may help with the paths issue with usernames and paths not using Latin characters, but file handling functions will need updating to use unicode variants instead of Ascii.

Mauller · 2026-04-07T07:47:21Z

I wonder if we should also add a flag to state that the Ascii string is holding a UTF8 string?

I guess all normal ascii characters will display properly, it's just extended character sets that will look garbled.

…ming, null-terminate when room

OmniBlade · 2026-04-08T08:56:40Z

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

bobtista · 2026-04-08T14:20:32Z

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8?

Mauller · 2026-04-13T09:33:52Z

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8?

If anything it would be Utf16Le_To_Utf8, windows uses the little endian utf16 format. Not sure if Utf16Be is used much anywhere but worth being concise with it.

… truncation

… <= 0 checks

xezon · 2026-04-18T18:14:43Z

-			return required;
-		}
-	}
+	WWASSERT(destLen >= Utf16Le_To_Utf8_Len(src, srcLen));


Can we perhaps use written to assert with, instead of another call to WideCharToMultiByte ?

xezon · 2026-04-18T18:16:21Z

@@ -109,24 +112,9 @@ size_t Utf8_To_Utf16Le_Len(const char* src, size_t srcLen)
 	return (wchars > 0) ? (size_t)wchars : 0;


Maybe do >=, so the branch predictor is 100% correct.

Or max(0, wchars)

xezon · 2026-04-18T18:16:33Z

Minor: const

xezon · 2026-04-18T18:21:30Z

-	}
+	WWASSERT(destLen >= Utf16Le_To_Utf8_Len(src, srcLen));
 	const int written = WideCharToMultiByte(CP_UTF8, 0, src, (int)srcLen, dest, (int)destLen, nullptr, nullptr);
 	if (written <= 0)


Is this now a contradiction to the assert? Would this only be true if the assert was failing?

Yeah this branch is dead code when the assert holds, but WWASSERT compiles out in release, so do we keep it?

You can replace this entire branch and substitute it with written = max(0, written)

githubawn · 2026-04-21T23:27:36Z

Could the unconditional replacement of translate() silently break callers passing legacy CP1252 data, causing strings that are valid CP1252 but invalid UTF-8 to corrupt or clear?

Maybe something like:
Check if the source is valid UTF-8 via Utf8_Validate.
If valid: Proceed with UTF-8 conversion.
If invalid: Fall back to the legacy 1:1 byte-to-wide cast (treating it as CP1252).

xezon · 2026-04-23T20:54:00Z

Maybe put a breakpoint and check usage patterns.

xezon · 2026-05-30T10:46:43Z

What is the current state of this? Can it be finalized?

…ure handling

bobtista · 2026-06-29T19:06:18Z

What is the current state of this? Can it be finalized?

Yeah, I just pushed cleanups addressing the open threads, it should be ready for another pass

greptile-apps Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated

xezon reviewed Apr 3, 2026

View reviewed changes

xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026

greptile-apps Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread Core/GameEngine/Source/GameNetwork/GameInfo.cpp Outdated

xezon reviewed Apr 4, 2026

View reviewed changes

xezon reviewed Apr 5, 2026

View reviewed changes

xezon reviewed Apr 6, 2026

View reviewed changes

feat(utf8): add UTF-8 string conversion and validation functions

40393b8

bobtista force-pushed the bobtista/feat/utf8-string-functions branch from 39d7229 to 40393b8 Compare April 6, 2026 20:02

Mauller reviewed Apr 7, 2026

View reviewed changes

bobtista added 2 commits April 7, 2026 14:37

refactor(utf8): Return size_t from conversions, use consistent len na…

abb71f0

…ming, null-terminate when room

refactor(utf8): Update callers to use new conversion API

0c9074d

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt

refactor(utf8): rename to Utf16Le_To_Utf8 and return required size on…

149a07f

… truncation

xezon reviewed Apr 17, 2026

View reviewed changes

refactor(utf8): add writeDirect mode, use _Len helpers, const locals,…

6097799

… <= 0 checks

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated

xezon reviewed Apr 18, 2026

View reviewed changes

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated

refactor(utf8): simplify conversion API and reject UTF-16 surrogates

9078de5

greptile-apps Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated

xezon reviewed Apr 18, 2026

View reviewed changes

bobtista added 2 commits April 18, 2026 14:32

style(utf8): assert after write, add braces, const locals

327bb4b

style(utf8): Use >= 0 in length return ternaries

4d5d2dc

githubawn mentioned this pull request Apr 21, 2026

feat(input): Implement SDL3 input and window management #2639

Draft

xezon mentioned this pull request Jun 15, 2026

refactor(string): Add functions for handling UTF8 encoded strings #2045

Closed

bobtista added 3 commits June 29, 2026 14:59

refactor(utf8): remove unused validators and simplify conversion fail…

f582ac8

…ure handling

style(string): add braces to translate conversion check

40683de

fix(string): return empty string when ThreadUtils conversion fails

44f8fca

		@@ -109,24 +112,9 @@ size_t Utf8_To_Utf16Le_Len(const char* src, size_t srcLen)
		return (wchars > 0) ? (size_t)wchars : 0;

Uh oh!

Conversation

bobtista commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New: WWLib/utf8.h / utf8.cpp

AsciiString::translate / UnicodeString::translate

ThreadUtils.cpp

Uh oh!

greptile-apps Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

bobtista commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xezon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xezon commented Apr 6, 2026

Uh oh!

bobtista commented Apr 6, 2026

Uh oh!

Mauller Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mauller commented Apr 7, 2026

Uh oh!

Uh oh!

OmniBlade commented Apr 8, 2026

Uh oh!

bobtista commented Apr 8, 2026

Uh oh!

Mauller commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bobtista commented Apr 3, 2026 •

edited

Loading

New: `WWLib/utf8.h` / `utf8.cpp`

`AsciiString::translate` / `UnicodeString::translate`

`ThreadUtils.cpp`

greptile-apps Bot commented Apr 3, 2026 •

edited

Loading

Mauller Apr 7, 2026 •

edited

Loading