[RNE Rewrite] Add text and image embeddings pipelines#1292
Conversation
Add int64/Long tensor dtype support and text/image embeddings tasks, hooks, and model registry entries, plus an interactive text-embeddings demo screen in apps/nlp. Closes #1247
model.execute now validates dynamically-shaped forward inputs against the model-declared [min, max, step] bounds exposed by an optional get_dynamic_dims method, instead of requiring an exact shape match; models without it keep exact per-dimension validation. Text embeddings feed the exact token length with no padding, which fixes scale-sensitive pooling heads (e.g. DistilUSE's tanh projection). Point DistilUSE at v0.10.0 (re-exported with get_dynamic_dims).
…mbeddings demo - Simplify text-embeddings cosine to a dot product (all models L2-normalize) and drop redundant inline comments. - Move the get_dynamic_dims / input-validation contract into the ModelHostObject class docs; trim the inline narration in model.cpp. - Add an Image Embeddings example to the computer-vision app: pick two images and compare their CLIP embeddings by cosine similarity.
Rework the computer-vision Image Embeddings screen (based on main's CLIP demo): pick an image and rank editable text labels by CLIP image/text embedding similarity, instead of the uninformative two-image score. Pads the scroll content past the Android nav bar. Point CLIP text + image at v0.10.0 (text re-exported with get_dynamic_dims; image unchanged) and declare the textEmbeddings feature in the app.
- model.{h,cpp}: read get_dynamic_dims once per model and cache it instead
of re-executing the method on every forward() call; reject a present-but-
malformed declaration (wrong dtype/rank/shape, bad min/max/step, or row
count not matching forward's tensor input dims) with an explicit error
instead of silently falling back to exact validation.
- textEmbeddings: throw a clear error when input tokenizes to zero tokens
(was BigInt(undefined)); fix docstring to match no-padding behavior.
- useTextEmbeddings: expose localPath/tokenizerPath like sibling hooks.
- computer-vision: extract shared skImageToBuffer helper, dedup from
classification and imageEmbeddings screens.
8e0f200 to
ede040e
Compare
Rebase onto rne-rewrite adopted #1296's rewritten model.cpp, which delegates tensor dtype/shape checks to tensor::fromJs and already supports RangeDim [min, max, step] bounds. Re-implement variable-length forward inputs on top of it: parse get_dynamic_dims once per method into cached bounds, build a SymbolicShape of RangeDims, and pass it to fromJs. Statically shaped methods keep exact validation.
ede040e to
6c3ccc4
Compare
Align the text/image embeddings tasks with the add-task-pipeline skill (and every other task): allocate the static output tensor in a `[...] as const` array, destructure it, and dispose via `tensors.forEach`.
barhanc
left a comment
There was a problem hiding this comment.
Checked only lib implementation. I will take a look at app code on Monday.
Regarding core/ dynamic input support changes, I've added some suggestions that imo make the code more future-proof and easier to change.
Regarding TS side everything is fine, just some small nits. I was thinking though if we don't want to take the opportunity of this refactor and beef-up the TS embeddings side a bit more, something like unifying both image and text embeddings into one pipeline that on top of exposing simple embed method would also expose methods implementing a small vector search data structure like insert, clear, find, etc. I don't know how much effort would that be so it's your call. The current implementation is correct.
|
|
||
| std::shared_ptr<TensorHostObject> tensorHostObject; | ||
| if (dynamicInputBounds.empty()) { | ||
| tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, tensorMeta.sizes()); | ||
| } else { | ||
| // Map bounds by the method-declared rank so mapping is | ||
| // independent of the caller-supplied shape. | ||
| const auto rank = tensorMeta.sizes().size(); | ||
| if (boundsOffset + rank > dynamicInputBounds.size()) { | ||
| throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares fewer " | ||
| "dimensions ({}) than forward's tensor " | ||
| "inputs require", | ||
| dynamicInputBounds.size())); | ||
| } | ||
| tensor::SymbolicShape expectedShape; | ||
| expectedShape.reserve(rank); | ||
| for (size_t d = 0; d < rank; ++d) { | ||
| const auto &row = dynamicInputBounds[boundsOffset + d]; | ||
| tensor::RangeDim rangeDim; | ||
| rangeDim.min = static_cast<int32_t>(row[0]); | ||
| rangeDim.max = static_cast<int32_t>(row[1]); | ||
| if (row[2] > 1) { | ||
| rangeDim.step = static_cast<int32_t>(row[2]); | ||
| } | ||
| expectedShape.emplace_back(rangeDim); | ||
| } | ||
| boundsOffset += rank; | ||
| tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, | ||
| std::optional<tensor::SymbolicShape>(std::move(expectedShape))); | ||
| } |
There was a problem hiding this comment.
| std::shared_ptr<TensorHostObject> tensorHostObject; | |
| if (dynamicInputBounds.empty()) { | |
| tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, tensorMeta.sizes()); | |
| } else { | |
| // Map bounds by the method-declared rank so mapping is | |
| // independent of the caller-supplied shape. | |
| const auto rank = tensorMeta.sizes().size(); | |
| if (boundsOffset + rank > dynamicInputBounds.size()) { | |
| throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares fewer " | |
| "dimensions ({}) than forward's tensor " | |
| "inputs require", | |
| dynamicInputBounds.size())); | |
| } | |
| tensor::SymbolicShape expectedShape; | |
| expectedShape.reserve(rank); | |
| for (size_t d = 0; d < rank; ++d) { | |
| const auto &row = dynamicInputBounds[boundsOffset + d]; | |
| tensor::RangeDim rangeDim; | |
| rangeDim.min = static_cast<int32_t>(row[0]); | |
| rangeDim.max = static_cast<int32_t>(row[1]); | |
| if (row[2] > 1) { | |
| rangeDim.step = static_cast<int32_t>(row[2]); | |
| } | |
| expectedShape.emplace_back(rangeDim); | |
| } | |
| boundsOffset += rank; | |
| tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, | |
| std::optional<tensor::SymbolicShape>(std::move(expectedShape))); | |
| } | |
| std::shared_ptr<TensorHostObject> tensorHostObject; | |
| if (self->dynamicInputShapes_.contains(methodName)) { | |
| auto expectedShape = self->dynamicInputShapes_[methodName][i]; | |
| tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, expectedShape); | |
| } else { | |
| tensorHostObject = tensor::fromJs(rt, ctx, val, expectedDtype, tensorMeta.sizes()); | |
| } |
There was a problem hiding this comment.
We want to minimize changes to the execute method required for dynamic inputs validation so that the code is future-proof, e.g. when ExecuTorch adds native support or we will want to change how the dynamic shapes are parsed.
| if (!dynamicInputBounds.empty() && boundsOffset != dynamicInputBounds.size()) { | ||
| throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares more dimensions ({}) " | ||
| "than forward's tensor inputs use ({})", | ||
| dynamicInputBounds.size(), boundsOffset)); | ||
| } | ||
|
|
There was a problem hiding this comment.
| if (!dynamicInputBounds.empty() && boundsOffset != dynamicInputBounds.size()) { | |
| throw jsi::JSError(rt, std::format("execute: get_dynamic_dims declares more dimensions ({}) " | |
| "than forward's tensor inputs use ({})", | |
| dynamicInputBounds.size(), boundsOffset)); | |
| } |
| // Per-dimension [min, max, step] bounds parsed from get_dynamic_dims | ||
| // at construction. Absent for statically shaped methods, which then | ||
| // validate exactly. | ||
| const std::vector<std::array<int64_t, 3>> noBounds; | ||
| auto boundsIt = self->dynamicInputBounds_.find(methodName); | ||
| const auto &dynamicInputBounds = | ||
| boundsIt != self->dynamicInputBounds_.end() ? boundsIt->second : noBounds; | ||
| size_t boundsOffset = 0; | ||
|
|
There was a problem hiding this comment.
| // Per-dimension [min, max, step] bounds parsed from get_dynamic_dims | |
| // at construction. Absent for statically shaped methods, which then | |
| // validate exactly. | |
| const std::vector<std::array<int64_t, 3>> noBounds; | |
| auto boundsIt = self->dynamicInputBounds_.find(methodName); | |
| const auto &dynamicInputBounds = | |
| boundsIt != self->dynamicInputBounds_.end() ? boundsIt->second : noBounds; | |
| size_t boundsOffset = 0; |
| * Writes data from a typed array into this tensor's native buffer. | ||
| * @param src The source typed array. Its size in bytes must match the | ||
| * tensor's size. | ||
| * tensor's size. Use a `BigInt64Array` for `int64` tensors. |
There was a problem hiding this comment.
| * tensor's size. Use a `BigInt64Array` for `int64` tensors. | |
| * tensor's size. |
| std::unique_ptr<executorch::extension::Module> etModule_; | ||
| std::mutex mutex_; | ||
|
|
||
| std::unordered_map<std::string, std::vector<std::array<int64_t, 3>>> dynamicInputBounds_; |
There was a problem hiding this comment.
| std::unordered_map<std::string, std::vector<std::array<int64_t, 3>>> dynamicInputBounds_; | |
| std::unordered_map<std::string, std::vector<tensor::SymbolicShape>> dynamicInputShapes_; |
Let's use the tensor::SymbolicShape directly so that we don't have to build it on every execute call and can simplify the execute code.
| * @param input The input text to embed. | ||
| * @returns A promise resolving to the embedding vector. | ||
| */ | ||
| forward: (input: string) => Promise<Float32Array>; |
There was a problem hiding this comment.
Same as in image embeddings, more descriptive name would be better imo.
| const tokenIds = tensor('int64', [1, len], idsData); | ||
| const attentionMask = tensor('int64', [1, len], maskData); |
There was a problem hiding this comment.
Please use the t<Name> naming convention for tensor variables.
| * @returns A promise resolving to an object containing the embedding and | ||
| * disposal controls. | ||
| */ | ||
| export async function createTextEmbeddings( |
There was a problem hiding this comment.
Probably should be named ...Embedder to match other tasks. Same with file name, perhaps ...Embedding.ts (no 's') would be more consistent.
| if (ids.length === 0) { | ||
| throw new Error('createTextEmbeddings: input tokenized to zero tokens'); | ||
| } | ||
| const len = Math.min(ids.length, maxSeqLen); |
There was a problem hiding this comment.
Worth documenting the truncating behaviour on long inputs.
| export * from './tasks/tokenization'; | ||
| export * from './tasks/textEmbeddings'; |
There was a problem hiding this comment.
Tasks shouldn't be explicitly exported from /extensions/<domain>/index.ts.
Description
Adds text and image embeddings pipelines to the new architecture, achieving parity with the old flow. Embeddings are pure-TypeScript tasks (pooling + L2-norm stay baked into the
.pte): text tokenizes and runsforward; image reuses the existing image preprocessor. To run the existing int64-input embedding models unchanged, this adds anint64/Longtensor dtype to the core (the tensor data path is byte-oriented, so it is a smalldtype.{h,cpp}+tensor.tschange).Text inputs are fed at their exact token length (no padding).
model.executevalidates dynamically-shapedforwardinputs against the[min, max, step]bounds exposed by an optionalget_dynamic_dimsmethod; models without it keep exact per-dimension validation. This fixes scale-sensitive pooling heads (e.g. DistilUSE's tanh projection), which padding otherwise corrupts.Includes
createTextEmbeddings/createImageEmbeddingstasks,useTextEmbeddings/useImageEmbeddingshooks,models.textEmbeddings/models.imageEmbeddingsregistry entries, an interactive text-embeddings demo inapps/nlp, and a CLIP zero-shot image-embeddings demo inapps/computer-vision.Introduces a breaking change?
Type of change
Tested on
Testing instructions
nlpapp → Text Embeddings: seeds a sentence library; type a query and Find similar to rank by cosine similarity, switch models via the chips. Verified on a physical Android device (arm64): all-MiniLM-L6-v2 returns 384-dim L2-normalized embeddings (~25 ms/forward on XNNPACK); DistilUSE ranks correctly with a wide similarity spread (previously compressed by padding).computer-visionapp → Image Embeddings: pick an image and rank editable text labels via CLIP zero-shot (image vs. text embeddings). Verified on device.Screenshots
Related issues
#1247
Checklist
Additional notes
DistilUSE and CLIP (text) are re-exported with the
get_dynamic_dimsmethod and pinned tov0.10.0; the remaining text-embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, multi-qa MiniLM/MPNet, paraphrase-ML) still need re-export tov0.10.0.