Offline Translation (Image Recognition and Translation)

GitHub App Store (currently under review at the App Store)

Description

Offline Translation is a fully on-device iOS app that translates text embedded in photos using Apple-native models as well as open-source models for OCR and multilingual machine translation. The app supports multiple pipelines—ranging from modular OCR + translation systems to end-to-end vision-language models—to benchmark translation quality across both high-resource and low-resource languages. All translation and inference happens on-device, without relying on cloud services or internet access.


Use Case

This app is built for travelers or users in low-connectivity environments who need fast and private translation of real-world text, such as road signs, menus, forms, or instructions. It functions offline, offering immediate utility in airports, remote regions, or while roaming internationally. The app also serves as a technical reference implementation for mobile AI deployment.


Market Comparison

Apple and Google both offer offline translation apps for IoS that rely on downloadable language kits. However, their OCR and translation components are not customizable, and language coverage is limited. Offline Translation investigates whether an open-source pipeline can surpass their translation quality and/or coverage.


Key Research Questions


Commercial Feasibility

1/5 – This app is not intended as a commercial product. Apple and Google's first-party distribution advantages make competition impractical. Instead, Offline Translation is a research vehicle to explore the performance and deployment limits of modern multilingual models on mobile devices.


Technical Goals

Architectures Compared

The app is implemented in three interchangeable pipelines, all with the same UI:

  1. Apple-native pipeline
    A baseline using Apple's own APIs for OCR (Vision), language detection (Natural Language), and translation (Translate). This pipeline serves as the control group to benchmark against.
  2. Hybrid pipeline
    Primarily uses Apple frameworks, but selectively falls back to third-party local models in failure cases. For example, if Apple Vision fails to detect text or the language is unsupported, the app invokes Google MLKit for OCR and a local copy of Meta's M2M-100 (418M) for multilingual translation.
  3. End-to-end vision-language model
    An experimental pipeline that uses a single compact vision-language model (e.g., MiniGPT-style) to go directly from an image to a translated string, bypassing modular components. Used to explore next-generation model architectures.

Choice of Models

Each pipeline uses models selected for their runtime performance, license compatibility, and support for low-resource languages:


Model Choice Tradeoffs

This project considered several open-source alternatives before selecting the final models for each component. The selection was based on technical feasibility, mobile optimization potential, and ability to support low-resource languages:

Design Doc

Design Document for Offline Translation

Evaluation Methodology

To assess the performance of each pipeline, we will conduct evaluations focusing on both individual components and the overall system:

Component-Level Evaluation

System-Level Evaluation

Results & Learnings

(In progress)

Next Steps

UI Updates

Model Updates