[AURON #1891] Implement randn() function#1938
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements the randn() function to improve Spark function coverage in Auron. The function generates random values from a standard normal distribution with optional seed support.
Changes:
- Added Rust implementation of
spark_randnfunction with seed handling - Registered the new function in the Scala converter and Rust function registry
- Added
rand_distrdependency for normal distribution sampling
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala | Added case handler for Randn expression to route to native implementation |
| native-engine/datafusion-ext-functions/src/spark_randn.rs | New implementation of randn function with seed handling and unit tests |
| native-engine/datafusion-ext-functions/src/lib.rs | Registered Spark_Randn function in the extension function factory |
| native-engine/datafusion-ext-functions/Cargo.toml | Added rand and rand_distr dependencies |
| Cargo.toml | Added rand_distr workspace dependency |
| Cargo.lock | Updated lock file with rand_distr package metadata |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Resolve conflicts between randn and spark_partition_id features: - Proto: spark_partition_id_expr at 20101, randn_expr at 20102 - Planner: include both expression handlers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@robreeves Nice work! LGTM. |
Resolved conflicts by assigning separate IDs to randn and monotonically_increasing_id: - MonotonicIncreasingIdExprNode: ID 20102 - RandnExprNode: ID 20103 Both expressions are now supported in the proto definitions and planner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add test to AuronFunctionSuite to verify randn functionality with seeds. The test validates that Auron's native randn implementation produces the same reproducible results as Spark's baseline when using explicit seeds. Test covers: - randn with seed 42 - randn with seed 100 - Validates against Spark baseline using checkSparkAnswerAndOperator Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
I added a |
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
@richox can you run the PR checks again? |
|
@cxzl25 can you run the PR checks? Thanks |
ShreyeshArangath
left a comment
There was a problem hiding this comment.
Changes LGTM, just one comment about the naming of this rust function.
Auron's ConfigOption alt keys (declared via addAltKey) were silently ignored: getFromSpark only consulted alt keys via ConfigEntry.findEntry (always null for Auron's unregistered options) and then synthesized a ConfigEntryWithDefaultFunction with an empty alternatives list, so only the primary key was ever read from SQLConf. As a result, e.g. setting spark.auron.enable (alt of spark.auron.enabled) had no effect. Pass the spark-prefixed alt keys as the synthesized entry's alternatives so ConfigEntry#readString reads primary +: alternatives, with the primary key taking precedence. Also add a test asserting alt keys are honored. Fixing this makes the test harness's spark.auron.enable=false baseline actually fall back to vanilla Spark, which exposed that acosh(0.0) yields NaN with a different (implementation-defined) bit pattern in each engine; QueryTest compares doubles via Double.doubleToRawLongBits, so update the acosh test to assert NaN-ness for the out-of-domain input rather than exact equality. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
QueryTest compares doubles via Double.doubleToRawLongBits, which is bit-exact. Vanilla Spark and the native engine can produce semantically equal NaNs with different (implementation-defined) bit patterns, so the comparison would spuriously fail. Canonicalize NaN on both sides before comparing. This lets the acosh null propagation test keep its original single-query form covering the out-of-domain (NaN) input. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cker Revert checkSparkAnswerAndOperator to plain checkAnswer and instead handle the NaN encoding difference locally in the acosh test. acosh of an out-of-domain input yields NaN, which vanilla Spark and the native engine may encode with different bits; checkAnswer/QueryTest compares doubles by raw bits. Split the test so in-domain/null values are compared numerically, and out-of-domain inputs are compared via the natively-supported isnan (a boolean) so no raw NaN bits are compared. This keeps the shared checker unchanged and avoids relaxing NaN comparison for all callers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With the config alt-key fix in place, the test harness's spark.auron.enable=false baseline now actually runs vanilla Spark, so checkSparkAnswerAndOperator compares Auron's randn against Spark's randn. These differ by design: the native engine uses StdRng/StandardNormal while Spark uses XORShiftRandom + nextGaussian, and randn is non-deterministic and not intended to be bit-compatible with Spark. Rewrite the randn tests to verify the expression is executed natively, produces a non-null value per row, and is reproducible for a fixed seed (and that different seeds produce different values), instead of asserting exact equality with vanilla Spark. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronFunctionSuite.scala
The randn tests re-implemented the "assert the executed plan is fully native" logic that checkSparkAnswerAndOperator already contains. Extract it into a protected assertNativeOperator method on AuronQueryTest, have checkSparkAnswerAndOperator call it, and reuse it from the randn tests instead of a duplicated local helper. Behavior is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
test_randn_generates_different_values_per_row asserted that all generated normal samples were distinct. Random samples are allowed to repeat, so that property isn't required for correctness. Assert only that the output is not constant across rows, which is what actually verifies per-row generation (vs. a single value broadcast). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
cargo fmt --check flagged the doc comment and test comment line widths in spark_randn.rs, failing the Style and Rust Lint CI checks. Reformat them with rustfmt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| /** Fail if any operator in the executed plan is not native or a pass-through. */ | ||
| protected def assertNativeOperator(df: DataFrame): Unit = { | ||
| val plan = stripAQEPlan(df.queryExecution.executedPlan) |
There was a problem hiding this comment.
Good point. Renamed to assertPlanIsNative to reflect that it scans the whole executed plan and fails on any non-native operator. Fixed in 7e77e1e.
| test("randn function with seed") { | ||
| withTable("t1") { | ||
| sql("CREATE TABLE t1(id INT) USING parquet") | ||
| sql("INSERT INTO t1 VALUES(1), (2), (3)") | ||
|
|
||
| // randn is non-deterministic and intentionally does not replicate Spark's RNG, so its | ||
| // values cannot be compared against vanilla Spark. Verify it runs natively, produces a | ||
| // non-null value per row, and is reproducible for a fixed seed. | ||
| val query = "SELECT id, randn(42) AS r1, randn(100) AS r2 FROM t1 ORDER BY id" | ||
| val df = sql(query) | ||
| val rows = df.collect() |
There was a problem hiding this comment.
Added a randn function without seed test that runs randn() with no seed and asserts native execution plus a non-null value per row (values aren't reproducible across executions since the seed is randomly assigned). Fixed in 7e77e1e.
- Rename assertNativeOperator to assertPlanIsNative, since it scans the whole executed plan and fails on any non-native operator rather than asserting on a single operator. - Add a "randn function without seed" test that runs randn() with no seed, asserting native execution and a non-null value per row (values are not reproducible across executions since the seed is randomly assigned). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Which issue does this PR close?
Closes #1891
Rationale for this change
This improves function coverage in Auron by creating a native randn implementation.
What changes are included in this PR?
Adds a native randn implementation.
Are there any user-facing changes?
Yes, it adds the randn function.
How was this patch tested?
Added unit tests and manually tested in spark-shell.
Output: