Skip to content

THRIFT-6069: python: add fastbinary decode_binary_from_bytes#3594

Open
markjm wants to merge 1 commit into
apache:masterfrom
markjm:apache-port-fastbinary-decode-from-bytes
Open

THRIFT-6069: python: add fastbinary decode_binary_from_bytes#3594
markjm wants to merge 1 commit into
apache:masterfrom
markjm:apache-port-fastbinary-decode-from-bytes

Conversation

@markjm

@markjm markjm commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Hi - I figured I'd share a few perf optimizations we are using internally. We are still on an older thrift version (😢 ), so I did use some agent magic to port these to the head of this repo. My testing was primarily on my branch, so buyer beware on that front!

This one is 1 of 3 PRs

⚠️ I did use AI tools to investigate and address feedback, but I am a real human ready to collaborate 😄


Add a bytes-based fastbinary decode entry point so callers that already hold serialized thrift data can bypass TMemoryBuffer and protocol wrapper setup while reusing the existing struct decoder.

Performance (50k iterations, warmed)

Workload decode_binary (transport) decode_binary_from_bytes Speedup
simple (30B) 3.59 us 0.81 us 4.43x
10-string (182B) 9.52 us 6.77 us 1.41x
complex (395B) 12.03 us 8.26 us 1.46x

The transport overhead (TMemoryBuffer + TBinaryProtocolAccelerated + BytesIO) is a fixed ~2.6us per call. For small structs this dominates; for larger structs the per-field decode work catches up.

@markjm markjm requested a review from mhlakhani as a code owner June 13, 2026 07:06
Copilot AI review requested due to automatic review settings June 13, 2026 07:06
@mergeable mergeable Bot added the python label Jun 13, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new fastbinary entrypoint to decode Thrift binary structs directly from a bytes object (bypassing transports), and validates behavior through Python tests.

Changes:

  • Add decode_binary_from_bytes(bytes, typeargs) to the Python C extension module.
  • Extend the C++ decode buffer to support direct reads from an in-memory bytes buffer.
  • Add Python tests (including accelerated protocol parity + error cases) using a simple Thrift-like struct.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/py/test/thrift_TBinaryProtocol.py Adds tests and a minimal struct to validate decoding from raw bytes vs transport-based decoding.
lib/py/src/ext/types.h Extends DecodeBuffer to store and track a direct bytes source and cursor.
lib/py/src/ext/protocol.tcc Implements direct-buffer reads and adds a new prepareDecodeBufferFromBytes initializer.
lib/py/src/ext/protocol.h Declares prepareDecodeBufferFromBytes on the protocol base class.
lib/py/src/ext/module.cpp Exposes decode_binary_from_bytes to Python and wires it into the module method table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/py/test/thrift_TBinaryProtocol.py Outdated
Comment thread lib/py/src/ext/protocol.tcc
Comment thread lib/py/src/ext/module.cpp
@markjm markjm force-pushed the apache-port-fastbinary-decode-from-bytes branch from 3af096a to 6af2e54 Compare June 13, 2026 07:11
@markjm markjm changed the title python: add fastbinary decode_binary_from_bytes [THRIFT-6069] python: add fastbinary decode_binary_from_bytes Jun 13, 2026
@Jens-G Jens-G changed the title [THRIFT-6069] python: add fastbinary decode_binary_from_bytes THRIFT-6069: python: add fastbinary decode_binary_from_bytes Jun 13, 2026
@Jens-G

Jens-G commented Jun 13, 2026

Copy link
Copy Markdown
Member

Code review

Found 2 issues:

  1. The second commit message contains "so extreme size_t values cannot wrap the position-plus-length check and bypass EOF enforcement." AGENTS.md §6 says to use neutral functional language for serialization bounds changes and never describe what an attacker can gain. A neutral rewrite would be: "Use an overflow-safe remaining-bytes comparison in the direct bytes decode path to prevent size_t wrapping."

6af2e54

  1. AI tool use is acknowledged in the PR body but neither commit contains the required Co-Authored-By: or Generated-by: label (AGENTS.md §4 says "Always label AI-assisted commits and PRs … Apply this label even when AI only generated a portion of the change").

thrift/AGENTS.md

Lines 57 to 71 in 35c1a53

## 4. AI-Generated Contributions
Per [`CONTRIBUTING.md § AI generated content`](CONTRIBUTING.md#ai-generated-content) and the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html):
- **Always** label AI-assisted commits and PRs. Use one or both of:
```
Co-Authored-By: <AI tool name and version>
Generated-by: <AI tool name and version>
```
Example:
```
THRIFT-9999: Fix connection timeout handling in Go client
Client: go
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@Jens-G

Jens-G commented Jun 18, 2026

Copy link
Copy Markdown
Member

Hi, any plans to continue here?

@markjm

markjm commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@Jens-G hi yes! Life took me away from code for a bit, but I'll get all the feedback addressed and PRs cleaned up for all the PRs I opened early/mid next week!

Apologies for the delay

Client: py

Add a bytes-based fastbinary decode entry point so callers that already
hold serialized thrift data can bypass TMemoryBuffer and protocol
wrapper setup while reusing the existing struct decoder. Use an
overflow-safe remaining-bytes comparison in the direct bytes decode
path so the position-plus-length check cannot wrap on extreme size_t
values.

Performance (50k iterations, warmed):

| Workload         | decode_binary (transport) | decode_binary_from_bytes | Speedup |
|------------------|---------------------------|--------------------------|---------|
| simple (30B)     | 3.59 us                   | 0.81 us                  | 4.43x   |
| 10-string (182B) | 9.52 us                   | 6.77 us                  | 1.41x   |
| complex (395B)   | 12.03 us                  | 8.26 us                  | 1.46x   |

The transport overhead (TMemoryBuffer + TBinaryProtocolAccelerated +
BytesIO) is a fixed ~2.6us per call. For small structs this dominates;
for larger structs the per-field decode work catches up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 24, 2026 17:49
@markjm markjm force-pushed the apache-port-fastbinary-decode-from-bytes branch from 6af2e54 to 2ca8447 Compare June 24, 2026 17:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread lib/py/src/ext/module.cpp
Comment on lines +142 to +164
static PyObject* decode_binary_from_bytes(PyObject*, PyObject* args) {
PyObject* bytes_obj = nullptr;
PyObject* typeargs = nullptr;
if (!PyArg_ParseTuple(args, "OO", &bytes_obj, &typeargs)) {
return nullptr;
}
if (!PyBytes_Check(bytes_obj)) {
PyErr_SetString(PyExc_TypeError, "first argument must be bytes");
return nullptr;
}

StructTypeArgs parsedargs;
if (!parse_struct_args(&parsedargs, typeargs)) {
return nullptr;
}

BinaryProtocol protocol;
if (!protocol.prepareDecodeBufferFromBytes(bytes_obj)) {
return nullptr;
}

return protocol.readStruct(Py_None, parsedargs.klass, parsedargs.spec);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants