2022-06-30

Privacy-preserving Image Search using Natural Language on iOS

Best Path Research has developed a privacy-focused prototype iOS app which allows a user to locally search all the photos stored on their Apple device without sending any data to a server. Both indexing and search take place entirely on a user’s mobile device and can be conducted while in “airplane mode” for complete confidence in the security of the user’s data.

Please take a look at the following screen recording of our demo app showing the real-time speed of indexing over 45,000 images on an iPhone12, as well as two examples of using a natural language search query in English to return the most relevant photos.

The technology used in this app is based on the CLIP (Contrastive Language-Image Pre-training) vision and text encoders, released by OpenAI in 2020. We extended these models by replacing the original, English-only, CLIP text encoder with a multi-lingual text encoder that can handle 40 different languages, thus extending the search capability of the app to those same 40 languages.

First, we create an index of all the photos on the device by generating an image “embedding” vector for each photo using the CLIP image model. We index this image vector using NGT, the extremely performant, open-source nearest-neighbour, vector database library, which we modified specifically to run on iOS. Once the index has been built, the image collection can be searched using natural language text queries. At query time, each text query is similarly converted to a vector embedding using the CLIP text model (actually the modified multilingual version of the original CLIP text model) and the distances returned by NGT are used to create a ranked list of the best-matching images, which are then displayed to the user.

One of the interesting aspects of this type of vector-based indexing and search approach is how it also allows the user to specify adjectives (such as “red” or “large”), and verbs (such as “running” or “drinking”), which are not typically helpful in a traditional image search engine. This functionality allows the user to narrow down the photo they are trying to find, but about which they might remember only vague details. In technical terms, we say that the app prioritizes recall over precision, with the expectation that returning a small collection of highly relevant results will allow the user to quickly and easily select the photo they were looking for, even if it doesn’t necessarily appear in the top ranking position.

In future, we plan to extend the app to also perform image-to-image search, which would allow a user, for example, to take a photo, or select an existing image, and find all the images in their camera roll that are most similar. To get such impressive real-time performance on iOS, Best Path Research applied its knowledge of PyTorch model tracing and size reduction to convert the CLIP encoder models to Apple’s CoreML format. Combined with the high-speed nearest-neighbour vector matching ability of NGT allowed us to perform searches of huge image collections in real-time on a mid-range mobile device.

Unfortunately, due to licensing issues related to OpenAI’s models, we are unable to release this demo app publicly. However, if the technology we have used is of interest, please feel free to contact us for a demo, or a time-limited app trial, or just to discuss how Best Path Research might be able to help you with your model development, conversion or implementation needs.

Keywords: Natural Language Image Search, CLIP, CoreML, PyTorch, Encoder-Decoder Transformer Models, Nearest-Neighbour Search, Vector Database, NGT, iOS app

Privacy-preserving Image Search using Natural Language on iOS

Search

Recent News

Recent Blog

Privacy-preserving Image Search using Natural Language on iOS

Tags:

Search

Recent News

Recent Blog