Skip to content

Integrate datafusion-distributed with datafusion-python #1612

Description

@gabotechs

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Allow running distributed queries in datafusion-python

Describe the solution you'd like

Ideally something well integrated with datafusion-python that does not require big changes or using different APIs for executing distributed queries.

datafusion-python is already a very ergonomic wrapper for using datafusion, so something maintaining that philosophy without introducing a lot of API surface would be ideal.

I'm interested specifically in using the datafusion-distributed library from within Python, and I see three mutually exclusive ways of integrating it:

  • Make datafusion-python depend on datafusion-distributed, hiding some internal plumbing in datafusion-python and extending the current API with distributed capabilities.
  • Create an external crate that depends on both datafusion-distributed and datafusion-python that ships an external API for using distributed functionality in datafusion-python
  • Make datafusion-distributed depend on datafusion-python, providing a set of functions and classes that decorate datafusion-python with distributed capabilities

I'm not sure which approach aligns best with this project's philosophy, the naive intuition from someone unfamiliar with this project is that the first option has greater chances of providing a well integrated experience, and it's probably the easiest to implement due to the fact that internal plumbing in the Rust world can be hidden in this project.

I actually tried this here:

And the fact that with only ~1K LOC, examples and tests included, can yield a functional integration, makes me think that it might actually not be a bad idea. But again, I don't know what I don't know, so would very gladly accept feedback and suggestions on something different.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions