Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Allow running distributed queries in datafusion-python
Describe the solution you'd like
Ideally something well integrated with datafusion-python that does not require big changes or using different APIs for executing distributed queries.
datafusion-python is already a very ergonomic wrapper for using datafusion, so something maintaining that philosophy without introducing a lot of API surface would be ideal.
I'm interested specifically in using the datafusion-distributed library from within Python, and I see three mutually exclusive ways of integrating it:
- Make
datafusion-python depend on datafusion-distributed, hiding some internal plumbing in datafusion-python and extending the current API with distributed capabilities.
- Create an external crate that depends on both
datafusion-distributed and datafusion-python that ships an external API for using distributed functionality in datafusion-python
- Make
datafusion-distributed depend on datafusion-python, providing a set of functions and classes that decorate datafusion-python with distributed capabilities
I'm not sure which approach aligns best with this project's philosophy, the naive intuition from someone unfamiliar with this project is that the first option has greater chances of providing a well integrated experience, and it's probably the easiest to implement due to the fact that internal plumbing in the Rust world can be hidden in this project.
I actually tried this here:
And the fact that with only ~1K LOC, examples and tests included, can yield a functional integration, makes me think that it might actually not be a bad idea. But again, I don't know what I don't know, so would very gladly accept feedback and suggestions on something different.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Allow running distributed queries in
datafusion-pythonDescribe the solution you'd like
Ideally something well integrated with
datafusion-pythonthat does not require big changes or using different APIs for executing distributed queries.datafusion-pythonis already a very ergonomic wrapper for usingdatafusion, so something maintaining that philosophy without introducing a lot of API surface would be ideal.I'm interested specifically in using the
datafusion-distributedlibrary from within Python, and I see three mutually exclusive ways of integrating it:datafusion-pythondepend ondatafusion-distributed, hiding some internal plumbing indatafusion-pythonand extending the current API with distributed capabilities.datafusion-distributedanddatafusion-pythonthat ships an external API for using distributed functionality indatafusion-pythondatafusion-distributeddepend ondatafusion-python, providing a set of functions and classes that decoratedatafusion-pythonwith distributed capabilitiesI'm not sure which approach aligns best with this project's philosophy, the naive intuition from someone unfamiliar with this project is that the first option has greater chances of providing a well integrated experience, and it's probably the easiest to implement due to the fact that internal plumbing in the Rust world can be hidden in this project.
I actually tried this here:
And the fact that with only ~1K LOC, examples and tests included, can yield a functional integration, makes me think that it might actually not be a bad idea. But again, I don't know what I don't know, so would very gladly accept feedback and suggestions on something different.