Skip to content

[AURON #1891] Implement randn() function#1938

Open
robreeves wants to merge 37 commits into
apache:masterfrom
robreeves:randn
Open

[AURON #1891] Implement randn() function#1938
robreeves wants to merge 37 commits into
apache:masterfrom
robreeves:randn

Conversation

@robreeves

@robreeves robreeves commented Jan 21, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #1891

Rationale for this change

This improves function coverage in Auron by creating a native randn implementation.

What changes are included in this PR?

Adds a native randn implementation.

Are there any user-facing changes?

Yes, it adds the randn function.

How was this patch tested?

Added unit tests and manually tested in spark-shell.

import org.apache.spark.sql.functions.randn

val df = spark.range(5)
val outputPath = "/tmp/spark_range_output.parquet"
df.write.mode("overwrite").parquet(outputPath)

val readDf = spark.read.parquet(outputPath)
val resultDf = readDf.withColumn("random_normal", randn(18))
resultDf.collect

Output:

26/01/30 15:41:22 WARN NativeHelper: memory total: 1408.0 MiB, onheap: 1024.0 MiB, offheap: 384.0 MiB
26/01/30 15:41:24 WARN AuronCallNativeWrapper: Start executing native plan
26/01/30 15:41:24 WARN AuronCallNativeWrapper: Start executing native plan
26/01/30 15:41:24 WARN AuronCallNativeWrapper: Start executing native plan
26/01/30 15:41:24 WARN AuronCallNativeWrapper: Start executing native plan
26/01/30 15:41:24 WARN AuronCallNativeWrapper: Start executing native plan
26/01/30 15:41:24 WARN AuronCallNativeWrapper: Start executing native plan
------ initializing auron native environment ------
initializing logging with level: info
2026-01-30 15:41:24.368 (+0.000s) [INFO] [auron::exec:73] (stage: 0, partition: 0, tid: 0) - initializing JNI bridge
2026-01-30 15:41:24.369 (+0.001s) [INFO] [auron_jni_bridge::jni_bridge:473] (stage: 0, partition: 0, tid: 0) - Initializing JavaClasses...
2026-01-30 15:41:24.375 (+0.007s) [INFO] [auron_jni_bridge::jni_bridge:529] (stage: 0, partition: 0, tid: 0) - Initializing JavaClasses finished
2026-01-30 15:41:24.375 (+0.007s) [INFO] [auron::exec:77] (stage: 0, partition: 0, tid: 0) - initializing datafusion session
2026-01-30 15:41:24.375 (+0.007s) [INFO] [auron_memmgr:48] (stage: 0, partition: 0, tid: 0) - mem manager initialized with total memory: 230.4 MiB
2026-01-30 15:41:24.385 (+0.017s) [INFO] [auron::rt:146] (stage: 2, partition: 1, tid: 12) - start executing plan:
ProjectExec [#3@0 AS #3, Randn(seed=18, partition=1) AS #5], schema=[#3:Int64;N, #5:Float64]
  RenameColumnsExec: ["#3"], schema=[#3:Int64;N]
    ParquetExec: limit=None, file_group=[FileGroup { files: [], statistics: None }, FileGroup { files: [PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy90bXAvc3BhcmtfcmFuZ2Vfb3V0cHV0LnBhcnF1ZXQvcGFydC0wMDAwMS04ZTkwNmRiYS0zZDg3LTRkZWMtYjM0NC1hYjdiZWUyODEwZWQtYzAwMC5zbmFwcHkucGFycXVldA" }, last_modified: 1970-01-01T00:00:00Z, size: 472, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 472 }), statistics: None, extensions: None, metadata_size_hint: None }], statistics: Some(Statistics { num_rows: Exact(0), total_byte_size: Exact(0), column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }) }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }], predicate=Some(Literal { value: Boolean(true), field: Field { name: "lit", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }), schema=[id:Int64;N]

2026-01-30 15:41:24.385 (+0.017s) [INFO] [auron::rt:146] (stage: 2, partition: 5, tid: 16) - start executing plan:
ProjectExec [#3@0 AS #3, Randn(seed=18, partition=5) AS #5], schema=[#3:Int64;N, #5:Float64]
  RenameColumnsExec: ["#3"], schema=[#3:Int64;N]
    ParquetExec: limit=None, file_group=[FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy90bXAvc3BhcmtfcmFuZ2Vfb3V0cHV0LnBhcnF1ZXQvcGFydC0wMDAwMC04ZTkwNmRiYS0zZDg3LTRkZWMtYjM0NC1hYjdiZWUyODEwZWQtYzAwMC5zbmFwcHkucGFycXVldA" }, last_modified: 1970-01-01T00:00:00Z, size: 297, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 297 }), statistics: None, extensions: None, metadata_size_hint: None }], statistics: Some(Statistics { num_rows: Exact(0), total_byte_size: Exact(0), column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }) }], predicate=Some(Literal { value: Boolean(true), field: Field { name: "lit", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }), schema=[id:Int64;N]

2026-01-30 15:41:24.385 (+0.017s) [INFO] [auron::rt:146] (stage: 2, partition: 2, tid: 13) - start executing plan:
ProjectExec [#3@0 AS #3, Randn(seed=18, partition=2) AS #5], schema=[#3:Int64;N, #5:Float64]
  RenameColumnsExec: ["#3"], schema=[#3:Int64;N]
    ParquetExec: limit=None, file_group=[FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy90bXAvc3BhcmtfcmFuZ2Vfb3V0cHV0LnBhcnF1ZXQvcGFydC0wMDAwMy04ZTkwNmRiYS0zZDg3LTRkZWMtYjM0NC1hYjdiZWUyODEwZWQtYzAwMC5zbmFwcHkucGFycXVldA" }, last_modified: 1970-01-01T00:00:00Z, size: 472, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 472 }), statistics: None, extensions: None, metadata_size_hint: None }], statistics: Some(Statistics { num_rows: Exact(0), total_byte_size: Exact(0), column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }) }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }], predicate=Some(Literal { value: Boolean(true), field: Field { name: "lit", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }), schema=[id:Int64;N]

2026-01-30 15:41:24.385 (+0.017s) [INFO] [auron::rt:146] (stage: 2, partition: 4, tid: 15) - start executing plan:
ProjectExec [#3@0 AS #3, Randn(seed=18, partition=4) AS #5], schema=[#3:Int64;N, #5:Float64]
  RenameColumnsExec: ["#3"], schema=[#3:Int64;N]
    ParquetExec: limit=None, file_group=[FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy90bXAvc3BhcmtfcmFuZ2Vfb3V0cHV0LnBhcnF1ZXQvcGFydC0wMDAwNS04ZTkwNmRiYS0zZDg3LTRkZWMtYjM0NC1hYjdiZWUyODEwZWQtYzAwMC5zbmFwcHkucGFycXVldA" }, last_modified: 1970-01-01T00:00:00Z, size: 471, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 471 }), statistics: None, extensions: None, metadata_size_hint: None }], statistics: Some(Statistics { num_rows: Exact(0), total_byte_size: Exact(0), column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }) }, FileGroup { files: [], statistics: None }], predicate=Some(Literal { value: Boolean(true), field: Field { name: "lit", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }), schema=[id:Int64;N]

2026-01-30 15:41:24.385 (+0.017s) [INFO] [auron::rt:146] (stage: 2, partition: 3, tid: 14) - start executing plan:
ProjectExec [#3@0 AS #3, Randn(seed=18, partition=3) AS #5], schema=[#3:Int64;N, #5:Float64]
  RenameColumnsExec: ["#3"], schema=[#3:Int64;N]
    ParquetExec: limit=None, file_group=[FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy90bXAvc3BhcmtfcmFuZ2Vfb3V0cHV0LnBhcnF1ZXQvcGFydC0wMDAwOS04ZTkwNmRiYS0zZDg3LTRkZWMtYjM0NC1hYjdiZWUyODEwZWQtYzAwMC5zbmFwcHkucGFycXVldA" }, last_modified: 1970-01-01T00:00:00Z, size: 472, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 472 }), statistics: None, extensions: None, metadata_size_hint: None }], statistics: Some(Statistics { num_rows: Exact(0), total_byte_size: Exact(0), column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }) }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }], predicate=Some(Literal { value: Boolean(true), field: Field { name: "lit", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }), schema=[id:Int64;N]

2026-01-30 15:41:24.385 (+0.017s) [INFO] [auron::rt:146] (stage: 2, partition: 0, tid: 11) - start executing plan:
ProjectExec [#3@0 AS #3, Randn(seed=18, partition=0) AS #5], schema=[#3:Int64;N, #5:Float64]
  RenameColumnsExec: ["#3"], schema=[#3:Int64;N]
    ParquetExec: limit=None, file_group=[FileGroup { files: [PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy90bXAvc3BhcmtfcmFuZ2Vfb3V0cHV0LnBhcnF1ZXQvcGFydC0wMDAwNy04ZTkwNmRiYS0zZDg3LTRkZWMtYjM0NC1hYjdiZWUyODEwZWQtYzAwMC5zbmFwcHkucGFycXVldA" }, last_modified: 1970-01-01T00:00:00Z, size: 472, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 472 }), statistics: None, extensions: None, metadata_size_hint: None }], statistics: Some(Statistics { num_rows: Exact(0), total_byte_size: Exact(0), column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }) }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }, FileGroup { files: [], statistics: None }], predicate=Some(Literal { value: Boolean(true), field: Field { name: "lit", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }), schema=[id:Int64;N]

2026-01-30 15:41:24.394 (+0.026s) [INFO] [datafusion_datasource_parquet::opener:421] (stage: 2, partition: 4, tid: 15) - executing parquet scan with adaptive batch size: 10000
2026-01-30 15:41:24.394 (+0.026s) [INFO] [datafusion_datasource_parquet::opener:421] (stage: 2, partition: 0, tid: 11) - executing parquet scan with adaptive batch size: 10000
2026-01-30 15:41:24.394 (+0.026s) [INFO] [datafusion_datasource_parquet::opener:421] (stage: 2, partition: 3, tid: 14) - executing parquet scan with adaptive batch size: 10000
2026-01-30 15:41:24.394 (+0.026s) [INFO] [datafusion_datasource_parquet::opener:421] (stage: 2, partition: 1, tid: 12) - executing parquet scan with adaptive batch size: 10000
2026-01-30 15:41:24.394 (+0.026s) [INFO] [datafusion_datasource_parquet::opener:421] (stage: 2, partition: 2, tid: 13) - executing parquet scan with adaptive batch size: 10000
2026-01-30 15:41:24.394 (+0.026s) [INFO] [datafusion_datasource_parquet::opener:421] (stage: 2, partition: 5, tid: 16) - executing parquet scan with adaptive batch size: 1
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:183] (stage: 2, partition: 5, tid: 16) - task finished
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:183] (stage: 2, partition: 0, tid: 11) - task finished
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:183] (stage: 2, partition: 4, tid: 15) - task finished
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:183] (stage: 2, partition: 3, tid: 14) - task finished
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:266] (stage: 0, partition: 0, tid: 0) - (partition=5) native execution finalizing
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:183] (stage: 2, partition: 2, tid: 13) - task finished
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:183] (stage: 2, partition: 1, tid: 12) - task finished
2026-01-30 15:41:24.488 (+0.120s) [INFO] [auron::rt:274] (stage: 0, partition: 0, tid: 0) - (partition=5) native execution finalized
2026-01-30 15:41:24.511 (+0.143s) [INFO] [auron::rt:266] (stage: 0, partition: 0, tid: 0) - (partition=3) native execution finalizing
2026-01-30 15:41:24.511 (+0.143s) [INFO] [auron::rt:266] (stage: 0, partition: 0, tid: 0) - (partition=4) native execution finalizing
2026-01-30 15:41:24.511 (+0.143s) [INFO] [auron::rt:266] (stage: 0, partition: 0, tid: 0) - (partition=0) native execution finalizing
2026-01-30 15:41:24.511 (+0.143s) [INFO] [auron::rt:266] (stage: 0, partition: 0, tid: 0) - (partition=2) native execution finalizing
2026-01-30 15:41:24.511 (+0.143s) [INFO] [auron::rt:266] (stage: 0, partition: 0, tid: 0) - (partition=1) native execution finalizing
2026-01-30 15:41:24.512 (+0.144s) [INFO] [auron::rt:274] (stage: 0, partition: 0, tid: 0) - (partition=0) native execution finalized
2026-01-30 15:41:24.512 (+0.144s) [INFO] [auron::rt:274] (stage: 0, partition: 0, tid: 0) - (partition=4) native execution finalized
2026-01-30 15:41:24.512 (+0.144s) [INFO] [auron::rt:274] (stage: 0, partition: 0, tid: 0) - (partition=1) native execution finalized
2026-01-30 15:41:24.512 (+0.144s) [INFO] [auron::rt:274] (stage: 0, partition: 0, tid: 0) - (partition=2) native execution finalized
2026-01-30 15:41:24.512 (+0.144s) [INFO] [auron::rt:274] (stage: 0, partition: 0, tid: 0) - (partition=3) native execution finalized
import org.apache.spark.sql.functions.randn
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
outputPath: String = /tmp/spark_range_output.parquet
readDf: org.apache.spark.sql.DataFrame = [id: bigint]
resultDf: org.apache.spark.sql.DataFrame = [id: bigint, random_normal: double]
res0: Array[org.apache.spark.sql.Row] = Array([3,1.4607292672705405], [0,-0.3268302897860617], [1,-0.09087682847007866], [4,-1.2271197538792842], [2,-0.546398027932835])

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the randn() function to improve Spark function coverage in Auron. The function generates random values from a standard normal distribution with optional seed support.

Changes:

  • Added Rust implementation of spark_randn function with seed handling
  • Registered the new function in the Scala converter and Rust function registry
  • Added rand_distr dependency for normal distribution sampling

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala Added case handler for Randn expression to route to native implementation
native-engine/datafusion-ext-functions/src/spark_randn.rs New implementation of randn function with seed handling and unit tests
native-engine/datafusion-ext-functions/src/lib.rs Registered Spark_Randn function in the extension function factory
native-engine/datafusion-ext-functions/Cargo.toml Added rand and rand_distr dependencies
Cargo.toml Added rand_distr workspace dependency
Cargo.lock Updated lock file with rand_distr package metadata

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread native-engine/datafusion-ext-functions/src/spark_randn.rs Outdated
Comment thread native-engine/datafusion-ext-functions/src/spark_randn.rs Outdated
Comment thread native-engine/datafusion-ext-functions/src/spark_randn.rs Outdated
robreeves and others added 8 commits January 21, 2026 21:40
Resolve conflicts between randn and spark_partition_id features:
- Proto: spark_partition_id_expr at 20101, randn_expr at 20102
- Planner: include both expression handlers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@robreeves robreeves marked this pull request as ready for review January 31, 2026 00:26
@yew1eb

yew1eb commented Jan 31, 2026

Copy link
Copy Markdown
Contributor

@robreeves Nice work! LGTM.
Could add SQL unit tests (per AuronFunctionSuite) to align with Spark SQl's semantics, thanks!

robreeves and others added 2 commits February 3, 2026 20:00
Resolved conflicts by assigning separate IDs to randn and monotonically_increasing_id:
- MonotonicIncreasingIdExprNode: ID 20102
- RandnExprNode: ID 20103

Both expressions are now supported in the proto definitions and planner.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add test to AuronFunctionSuite to verify randn functionality with seeds.
The test validates that Auron's native randn implementation produces
the same reproducible results as Spark's baseline when using explicit seeds.

Test covers:
- randn with seed 42
- randn with seed 100
- Validates against Spark baseline using checkSparkAnswerAndOperator

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@robreeves

Copy link
Copy Markdown
Contributor Author

@robreeves Nice work! LGTM. Could add SQL unit tests (per AuronFunctionSuite) to align with Spark SQl's semantics, thanks!

I added a AuronFunctionSuite test. Thanks.

robreeves and others added 2 commits February 8, 2026 07:36
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@robreeves

Copy link
Copy Markdown
Contributor Author

@richox can you run the PR checks again?

@robreeves

Copy link
Copy Markdown
Contributor Author

@cxzl25 can you run the PR checks? Thanks

@ShreyeshArangath ShreyeshArangath left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, just one comment about the naming of this rust function.

Comment thread native-engine/datafusion-ext-exprs/src/lib.rs Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Comment thread native-engine/datafusion-ext-exprs/src/spark_randn.rs
robreeves and others added 7 commits June 25, 2026 00:47
Auron's ConfigOption alt keys (declared via addAltKey) were silently
ignored: getFromSpark only consulted alt keys via ConfigEntry.findEntry
(always null for Auron's unregistered options) and then synthesized a
ConfigEntryWithDefaultFunction with an empty alternatives list, so only
the primary key was ever read from SQLConf. As a result, e.g. setting
spark.auron.enable (alt of spark.auron.enabled) had no effect.

Pass the spark-prefixed alt keys as the synthesized entry's alternatives
so ConfigEntry#readString reads primary +: alternatives, with the primary
key taking precedence.

Also add a test asserting alt keys are honored. Fixing this makes the
test harness's spark.auron.enable=false baseline actually fall back to
vanilla Spark, which exposed that acosh(0.0) yields NaN with a different
(implementation-defined) bit pattern in each engine; QueryTest compares
doubles via Double.doubleToRawLongBits, so update the acosh test to
assert NaN-ness for the out-of-domain input rather than exact equality.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
QueryTest compares doubles via Double.doubleToRawLongBits, which is
bit-exact. Vanilla Spark and the native engine can produce semantically
equal NaNs with different (implementation-defined) bit patterns, so the
comparison would spuriously fail. Canonicalize NaN on both sides before
comparing. This lets the acosh null propagation test keep its original
single-query form covering the out-of-domain (NaN) input.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cker

Revert checkSparkAnswerAndOperator to plain checkAnswer and instead handle
the NaN encoding difference locally in the acosh test. acosh of an
out-of-domain input yields NaN, which vanilla Spark and the native engine
may encode with different bits; checkAnswer/QueryTest compares doubles by
raw bits. Split the test so in-domain/null values are compared numerically,
and out-of-domain inputs are compared via the natively-supported isnan
(a boolean) so no raw NaN bits are compared. This keeps the shared checker
unchanged and avoids relaxing NaN comparison for all callers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With the config alt-key fix in place, the test harness's
spark.auron.enable=false baseline now actually runs vanilla Spark, so
checkSparkAnswerAndOperator compares Auron's randn against Spark's randn.
These differ by design: the native engine uses StdRng/StandardNormal while
Spark uses XORShiftRandom + nextGaussian, and randn is non-deterministic
and not intended to be bit-compatible with Spark.

Rewrite the randn tests to verify the expression is executed natively,
produces a non-null value per row, and is reproducible for a fixed seed
(and that different seeds produce different values), instead of asserting
exact equality with vanilla Spark.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@robreeves

Copy link
Copy Markdown
Contributor Author

This now includes #2361 changes. #2361 should be merged first then I will rebase this.

# Conflicts:
#	spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronFunctionSuite.scala
Copilot AI review requested due to automatic review settings July 1, 2026 17:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Comment thread native-engine/datafusion-ext-exprs/src/spark_randn.rs Outdated
robreeves and others added 2 commits July 1, 2026 17:25
The randn tests re-implemented the "assert the executed plan is fully
native" logic that checkSparkAnswerAndOperator already contains. Extract
it into a protected assertNativeOperator method on AuronQueryTest, have
checkSparkAnswerAndOperator call it, and reuse it from the randn tests
instead of a duplicated local helper. Behavior is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
test_randn_generates_different_values_per_row asserted that all generated
normal samples were distinct. Random samples are allowed to repeat, so
that property isn't required for correctness. Assert only that the output
is not constant across rows, which is what actually verifies per-row
generation (vs. a single value broadcast).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 1, 2026 17:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Comment thread native-engine/datafusion-ext-exprs/src/spark_randn.rs
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 1, 2026 17:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Comment thread native-engine/datafusion-ext-exprs/src/spark_randn.rs
Comment thread native-engine/datafusion-ext-exprs/src/spark_randn.rs
cargo fmt --check flagged the doc comment and test comment line widths in
spark_randn.rs, failing the Style and Rust Lint CI checks. Reformat them
with rustfmt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Comment on lines +72 to +74
/** Fail if any operator in the executed plan is not native or a pass-through. */
protected def assertNativeOperator(df: DataFrame): Unit = {
val plan = stripAQEPlan(df.queryExecution.executedPlan)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Renamed to assertPlanIsNative to reflect that it scans the whole executed plan and fails on any non-native operator. Fixed in 7e77e1e.

Comment on lines +1017 to +1027
test("randn function with seed") {
withTable("t1") {
sql("CREATE TABLE t1(id INT) USING parquet")
sql("INSERT INTO t1 VALUES(1), (2), (3)")

// randn is non-deterministic and intentionally does not replicate Spark's RNG, so its
// values cannot be compared against vanilla Spark. Verify it runs natively, produces a
// non-null value per row, and is reproducible for a fixed seed.
val query = "SELECT id, randn(42) AS r1, randn(100) AS r2 FROM t1 ORDER BY id"
val df = sql(query)
val rows = df.collect()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a randn function without seed test that runs randn() with no seed and asserts native execution plus a non-null value per row (values aren't reproducible across executions since the seed is randomly assigned). Fixed in 7e77e1e.

- Rename assertNativeOperator to assertPlanIsNative, since it scans the
  whole executed plan and fails on any non-native operator rather than
  asserting on a single operator.
- Add a "randn function without seed" test that runs randn() with no seed,
  asserting native execution and a non-null value per row (values are not
  reproducible across executions since the seed is randomly assigned).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement randn() function

6 participants