fix(extraction): recover heavily-reflected Unreal Engine C++ classes (in-body reflection macros)#1158
Open
luoyxy wants to merge 2 commits into
Open
fix(extraction): recover heavily-reflected Unreal Engine C++ classes (in-body reflection macros)#1158luoyxy wants to merge 2 commits into
luoyxy wants to merge 2 commits into
Conversation
…ted C++ classes survive
Unreal-Engine reflection markup — `UPROPERTY(...)`, `UFUNCTION(...)`,
`GENERATED_BODY()`, `UE_DEPRECATED_*(...)`, `DECLARE_DELEGATE_*(...)` — are
no-semicolon macro CALLS decorating members. tree-sitter's C++ grammar
doesn't know they are macros, so each drops into error recovery; in a
heavily-reflected class the errors accumulate until the enclosing
`class_specifier` can't close and the whole class — its base clause and
members — collapses into an ERROR node and disappears from the graph.
`CharacterMovementComponent.h` (UCharacterMovementComponent, ~240 such
macros) was dropped entirely, breaking subclass / type-hierarchy /
inheritance-impact queries for it.
Add `blankCppAnnotationMacroCalls` to the C++ preParse chain (after
`blankCppExportMacros` and `blankCppInlineMacros`). It blanks a
line-leading, ALL-CAPS, no-semicolon macro call with equal-length spaces
(offset-preserving, so line/column stay exact) when the first char after
its balanced `(...)` starts a declaration (`[A-Za-z_~#]`) — i.e. the macro
decorates the thing that follows. The rule is name-list-FREE (keys on
structure, not a curated list), so it covers UE's hundreds of markup
macros and project-specific ones alike.
Matched tightly so it never touches legitimate C++: an expression/
condition use isn't line-leading (`if (CHECK(x))`), a statement call ends
in `;` (`FOO(x);`), an init-list item ends in `,`/`{` (`: MEMBER_A(1),`),
and an expression fragment is followed by an operator (`MAKE(a) + 1`) —
all rejected. String/char literals inside the args are skipped so an
embedded `)` can't mis-close the balance.
Verified on the real UCharacterMovementComponent.h (class recovered) with
regression tests covering the recovery and the four non-markup shapes.
Co-authored-by: Cursor <cursoragent@cursor.com>
…reflection annotations Follow-up to the in-body reflection-macro fix, closing the remaining gaps that still dropped large Unreal-Engine classes. Two offset-preserving pre-parse passes, both C++-only and tightly guarded: - blankCppApiPrefixMacros: the *_API / *_EXPORT / *_ABI visibility macro also prefixes nearly every exported member of a big UE class (ENGINE_API virtual void Tick(...), static ENGINE_API void Foo(...)). tree-sitter reads the macro as an extra type token, so each declaration falls into error recovery and its return type becomes an orphan ERROR; on Actor.h / World.h hundreds accumulate and can still tip the class into collapse. Blanked by ALL-CAPS token ending in the conventional suffix and immediately followed by a declaration token, so a value use (x = FOO_API;, == FOO_API)) never matches. - blankCppInlineAnnotationMacros: UMETA / UPARAM / UE_DEPRECATED* can sit mid-line where the line-leading recovery can't reach - an enum value's UMETA(...), a parameter's UPARAM(ref), or a deprecation tag inside a using alias (using X UE_DEPRECATED(5.5,"...") = ...;, which alone collapsed UWorld in World.h). Matched by a UE-only name list (zero risk to non-UE code) and blanked with balanced-paren scanning (string literals skipped). Verified on the real engine headers: the main class of Actor, ActorComponent, SkeletalMeshComponent, World, LightComponent, and CharacterMovementComponent is now recovered, with residual tree-sitter errors cut from the hundreds to single/low-double digits. Adds regression tests (recovery + offset-preserving blank + non-declaration guard cases); full extraction suite green (the only failures are the pre-existing node:sqlite FTS5 / Windows EBUSY environment issues, unrelated to parsing). Co-authored-by: Cursor <cursoragent@cursor.com>
0a04f22 to
3c4c8e8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #1093, and a companion to #1133 (which fixes the
.hlanguagemisdetection for macro-annotated class headers). This PR recovers heavily
reflected Unreal Engine C++ classes that tree-sitter drops today because of
the reflection macros sprinkled through the class body — not just on the
header.
Three offset-preserving pre-parse blanking passes, all gated so standard C++ and
other libraries are untouched:
In-body annotation macros —
UPROPERTY(...),UFUNCTION(...),UCLASS(...),GENERATED_BODY(),UE_DEPRECATED_*(...),DECLARE_DELEGATE_*(...)are no-semicolon macro calls tree-sitter doesn'trecognize, so each drops into error recovery. In a big class the errors pile
up until the whole
class_specifiercollapses and the class, its base clauseand its members vanish.
UCharacterMovementComponent(~240 such macros)disappeared entirely, breaking every subclass / type-hierarchy /
blast-radius query that went through it. Line-leading annotation macros are
now blanked before parsing so the class survives.
Member/method-level export macros — the
*_APImacro doesn't only sit onthe class header; it prefixes almost every exported member of a large UE
class (
ENGINE_API virtual void Tick(...),static ENGINE_API void AddReferencedObjects(...)). The parser read themacro as an extra type token and each such declaration fell into error
recovery — on headers like
Actor.handWorld.hhundreds of return typespiled up as orphan errors and could still tip the class into collapse.
Member/method-level
*_API/*_EXPORT/*_ABImacros (Unreal, Qt/Boost,LLVM) are now blanked before parsing, mirroring the existing class-header
recovery.
Mid-line annotation macros — an enum value's
UMETA(DisplayName=...), aparameter's
UPARAM(ref), or a deprecation tag wedged into ausingalias(
using FOnNetTick UE_DEPRECATED(5.5, "...") = ...;, which alone collapsedUWorldinWorld.h). These sit in positions the line-leading recoverystructurally can't reach, and a single one could take down the surrounding
enum or class. They are matched by an Unreal-only name list (
UMETA,UPARAM,UE_DEPRECATED*) so no standard-C++ or other-library code isaffected.
Together these three fixes recover the main class of every large Unreal Engine
header tested:
Actor,ActorComponent,SkeletalMeshComponent,World,LightComponent,CharacterMovementComponent.Changes
src/extraction/languages/c-cpp.ts— three new blanking passes chained intopreParseCppSource; offset-preserving, Unreal / allow-list gated.__tests__/extraction.test.ts— regression tests for each pass (classrecovery, blanking correctness, and guard/non-regression on plain C++).
CHANGELOG.md— three entries under[Unreleased] › Fixes.Test plan
npm test— extraction suite green; new cases assert class recovery,extendsedges (incl. multi-interface bases), and inline method defs.UCharacterMovementComponent,UAbilitySystemComponent,AActor,UWorld,UGameplayAbilityresolveto their definition bodies (not
[]/ forward-decl-only), andmulti-interface
extendsedges are complete.Fixes #1160.