Skip to content

[RNE Rewrite] Add text and image embeddings pipelines#1292

Open
msluszniak wants to merge 8 commits into
rne-rewritefrom
@ms/add-embeddings
Open

[RNE Rewrite] Add text and image embeddings pipelines#1292
msluszniak wants to merge 8 commits into
rne-rewritefrom
@ms/add-embeddings

Conversation

@msluszniak

@msluszniak msluszniak commented Jun 30, 2026

Copy link
Copy Markdown
Member

Description

Adds text and image embeddings pipelines to the new architecture, achieving parity with the old flow. Embeddings are pure-TypeScript tasks (pooling + L2-norm stay baked into the .pte): text tokenizes and runs forward; image reuses the existing image preprocessor. To run the existing int64-input embedding models unchanged, this adds an int64/Long tensor dtype to the core (the tensor data path is byte-oriented, so it is a small dtype.{h,cpp} + tensor.ts change).

Text inputs are fed at their exact token length (no padding). model.execute validates dynamically-shaped forward inputs against the [min, max, step] bounds exposed by an optional get_dynamic_dims method; models without it keep exact per-dimension validation. This fixes scale-sensitive pooling heads (e.g. DistilUSE's tanh projection), which padding otherwise corrupts.

Includes createTextEmbeddings / createImageEmbeddings tasks, useTextEmbeddings / useImageEmbeddings hooks, models.textEmbeddings / models.imageEmbeddings registry entries, an interactive text-embeddings demo in apps/nlp, and a CLIP zero-shot image-embeddings demo in apps/computer-vision.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  • nlp app → Text Embeddings: seeds a sentence library; type a query and Find similar to rank by cosine similarity, switch models via the chips. Verified on a physical Android device (arm64): all-MiniLM-L6-v2 returns 384-dim L2-normalized embeddings (~25 ms/forward on XNNPACK); DistilUSE ranks correctly with a wide similarity spread (previously compressed by padding).
  • computer-vision app → Image Embeddings: pick an image and rank editable text labels via CLIP zero-shot (image vs. text embeddings). Verified on device.

Screenshots

Related issues

#1247

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

DistilUSE and CLIP (text) are re-exported with the get_dynamic_dims method and pinned to v0.10.0; the remaining text-embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, multi-qa MiniLM/MPNet, paraphrase-ML) still need re-export to v0.10.0.

@msluszniak msluszniak self-assigned this Jun 30, 2026
@msluszniak msluszniak linked an issue Jun 30, 2026 that may be closed by this pull request
@msluszniak msluszniak added the feature PRs that implement a new feature label Jun 30, 2026
Comment thread apps/nlp/app/text-embeddings/index.tsx Outdated
Comment thread apps/nlp/app/text-embeddings/index.tsx Outdated
Comment thread apps/nlp/app/text-embeddings/index.tsx Outdated
Comment thread packages/react-native-executorch/cpp/core/model.cpp Outdated
Comment thread packages/react-native-executorch/cpp/core/model.cpp Outdated
Comment thread apps/nlp/app/text-embeddings/index.tsx
@msluszniak msluszniak marked this pull request as ready for review July 1, 2026 16:07
@msluszniak msluszniak requested a review from barhanc July 1, 2026 16:07
Add int64/Long tensor dtype support and text/image embeddings tasks,
hooks, and model registry entries, plus an interactive text-embeddings
demo screen in apps/nlp.

Closes #1247
model.execute now validates dynamically-shaped forward inputs against the
model-declared [min, max, step] bounds exposed by an optional
get_dynamic_dims method, instead of requiring an exact shape match; models
without it keep exact per-dimension validation. Text embeddings feed the
exact token length with no padding, which fixes scale-sensitive pooling
heads (e.g. DistilUSE's tanh projection).

Point DistilUSE at v0.10.0 (re-exported with get_dynamic_dims).
…mbeddings demo

- Simplify text-embeddings cosine to a dot product (all models L2-normalize)
  and drop redundant inline comments.
- Move the get_dynamic_dims / input-validation contract into the
  ModelHostObject class docs; trim the inline narration in model.cpp.
- Add an Image Embeddings example to the computer-vision app: pick two images
  and compare their CLIP embeddings by cosine similarity.
Rework the computer-vision Image Embeddings screen (based on main's CLIP demo):
pick an image and rank editable text labels by CLIP image/text embedding
similarity, instead of the uninformative two-image score. Pads the scroll
content past the Android nav bar.

Point CLIP text + image at v0.10.0 (text re-exported with get_dynamic_dims;
image unchanged) and declare the textEmbeddings feature in the app.
- model.{h,cpp}: read get_dynamic_dims once per model and cache it instead
  of re-executing the method on every forward() call; reject a present-but-
  malformed declaration (wrong dtype/rank/shape, bad min/max/step, or row
  count not matching forward's tensor input dims) with an explicit error
  instead of silently falling back to exact validation.
- textEmbeddings: throw a clear error when input tokenizes to zero tokens
  (was BigInt(undefined)); fix docstring to match no-padding behavior.
- useTextEmbeddings: expose localPath/tokenizerPath like sibling hooks.
- computer-vision: extract shared skImageToBuffer helper, dedup from
  classification and imageEmbeddings screens.
@msluszniak msluszniak marked this pull request as draft July 3, 2026 14:24
@msluszniak msluszniak force-pushed the @ms/add-embeddings branch from 8e0f200 to ede040e Compare July 3, 2026 14:25
Comment thread packages/react-native-executorch/cpp/core/model.cpp Outdated
Rebase onto rne-rewrite adopted #1296's rewritten model.cpp, which delegates
tensor dtype/shape checks to tensor::fromJs and already supports RangeDim
[min, max, step] bounds. Re-implement variable-length forward inputs on top of
it: parse get_dynamic_dims once per method into cached bounds, build a
SymbolicShape of RangeDims, and pass it to fromJs. Statically shaped methods
keep exact validation.
@msluszniak msluszniak force-pushed the @ms/add-embeddings branch from ede040e to 6c3ccc4 Compare July 3, 2026 14:37
@msluszniak msluszniak marked this pull request as ready for review July 3, 2026 14:37
Align the text/image embeddings tasks with the add-task-pipeline skill (and
every other task): allocate the static output tensor in a `[...] as const`
array, destructure it, and dispose via `tensors.forEach`.

@barhanc barhanc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked only lib implementation. I will take a look at app code on Monday.


Regarding core/ dynamic input support changes, I've added some suggestions that imo make the code more future-proof and easier to change.


Regarding TS side everything is fine, just some small nits. I was thinking though if we don't want to take the opportunity of this refactor and beef-up the TS embeddings side a bit more, something like unifying both image and text embeddings into one pipeline that on top of exposing simple embed method would also expose methods implementing a small vector search data structure like insert, clear, find, etc. I don't know how much effort would that be so it's your call. The current implementation is correct.

Comment on lines +281 to +310

std::shared_ptr<TensorHostObject> tensorHostObject;
if (dynamicInputBounds.empty()) {
tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, tensorMeta.sizes());
} else {
// Map bounds by the method-declared rank so mapping is
// independent of the caller-supplied shape.
const auto rank = tensorMeta.sizes().size();
if (boundsOffset + rank > dynamicInputBounds.size()) {
throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares fewer "
"dimensions ({}) than forward's tensor "
"inputs require",
dynamicInputBounds.size()));
}
tensor::SymbolicShape expectedShape;
expectedShape.reserve(rank);
for (size_t d = 0; d < rank; ++d) {
const auto &row = dynamicInputBounds[boundsOffset + d];
tensor::RangeDim rangeDim;
rangeDim.min = static_cast<int32_t>(row[0]);
rangeDim.max = static_cast<int32_t>(row[1]);
if (row[2] > 1) {
rangeDim.step = static_cast<int32_t>(row[2]);
}
expectedShape.emplace_back(rangeDim);
}
boundsOffset += rank;
tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype,
std::optional<tensor::SymbolicShape>(std::move(expectedShape)));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<TensorHostObject> tensorHostObject;
if (dynamicInputBounds.empty()) {
tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, tensorMeta.sizes());
} else {
// Map bounds by the method-declared rank so mapping is
// independent of the caller-supplied shape.
const auto rank = tensorMeta.sizes().size();
if (boundsOffset + rank > dynamicInputBounds.size()) {
throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares fewer "
"dimensions ({}) than forward's tensor "
"inputs require",
dynamicInputBounds.size()));
}
tensor::SymbolicShape expectedShape;
expectedShape.reserve(rank);
for (size_t d = 0; d < rank; ++d) {
const auto &row = dynamicInputBounds[boundsOffset + d];
tensor::RangeDim rangeDim;
rangeDim.min = static_cast<int32_t>(row[0]);
rangeDim.max = static_cast<int32_t>(row[1]);
if (row[2] > 1) {
rangeDim.step = static_cast<int32_t>(row[2]);
}
expectedShape.emplace_back(rangeDim);
}
boundsOffset += rank;
tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype,
std::optional<tensor::SymbolicShape>(std::move(expectedShape)));
}
std::shared_ptr<TensorHostObject> tensorHostObject;
if (self->dynamicInputShapes_.contains(methodName)) {
auto expectedShape = self->dynamicInputShapes_[methodName][i];
tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, expectedShape);
} else {
tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, tensorMeta.sizes());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to minimize changes to the execute method required for dynamic inputs validation so that the code is future-proof, e.g. when ExecuTorch adds native support or we will want to change how the dynamic shapes are parsed.

Comment on lines +338 to +343
if (!dynamicInputBounds.empty() && boundsOffset != dynamicInputBounds.size()) {
throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares more dimensions ({}) "
"than forward's tensor inputs use ({})",
dynamicInputBounds.size(), boundsOffset));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!dynamicInputBounds.empty() && boundsOffset != dynamicInputBounds.size()) {
throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares more dimensions ({}) "
"than forward's tensor inputs use ({})",
dynamicInputBounds.size(), boundsOffset));
}

Comment on lines +263 to +271
// Per-dimension [min, max, step] bounds parsed from get_dynamic_dims
// at construction. Absent for statically shaped methods, which then
// validate exactly.
const std::vector<std::array<int64_t, 3>> noBounds;
auto boundsIt = self->dynamicInputBounds_.find(methodName);
const auto &dynamicInputBounds =
boundsIt != self->dynamicInputBounds_.end() ? boundsIt->second : noBounds;
size_t boundsOffset = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Per-dimension [min, max, step] bounds parsed from get_dynamic_dims
// at construction. Absent for statically shaped methods, which then
// validate exactly.
const std::vector<std::array<int64_t, 3>> noBounds;
auto boundsIt = self->dynamicInputBounds_.find(methodName);
const auto &dynamicInputBounds =
boundsIt != self->dynamicInputBounds_.end() ? boundsIt->second : noBounds;
size_t boundsOffset = 0;

* Writes data from a typed array into this tensor's native buffer.
* @param src The source typed array. Its size in bytes must match the
* tensor's size.
* tensor's size. Use a `BigInt64Array` for `int64` tensors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* tensor's size. Use a `BigInt64Array` for `int64` tensors.
* tensor's size.

std::unique_ptr<executorch::extension::Module> etModule_;
std::mutex mutex_;

std::unordered_map<std::string, std::vector<std::array<int64_t, 3>>> dynamicInputBounds_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::unordered_map<std::string, std::vector<std::array<int64_t, 3>>> dynamicInputBounds_;
std::unordered_map<std::string, std::vector<tensor::SymbolicShape>> dynamicInputShapes_;

Let's use the tensor::SymbolicShape directly so that we don't have to build it on every execute call and can simplify the execute code.

* @param input The input text to embed.
* @returns A promise resolving to the embedding vector.
*/
forward: (input: string) => Promise<Float32Array>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in image embeddings, more descriptive name would be better imo.

Comment on lines +105 to +106
const tokenIds = tensor('int64', [1, len], idsData);
const attentionMask = tensor('int64', [1, len], maskData);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the t<Name> naming convention for tensor variables.

* @returns A promise resolving to an object containing the embedding and
* disposal controls.
*/
export async function createTextEmbeddings(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be named ...Embedder to match other tasks. Same with file name, perhaps ...Embedding.ts (no 's') would be more consistent.

if (ids.length === 0) {
throw new Error('createTextEmbeddings: input tokenized to zero tokens');
}
const len = Math.min(ids.length, maxSeqLen);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth documenting the truncating behaviour on long inputs.

Comment on lines 2 to +3
export * from './tasks/tokenization';
export * from './tasks/textEmbeddings';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks shouldn't be explicitly exported from /extensions/<domain>/index.ts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RNE Rewrite] Add image and text embeddings pipelines

2 participants