imgutils.generic.clip

CLIP model interface for multimodal embeddings and predictions.

This module provides a comprehensive interface for working with CLIP models hosted on Hugging Face Hub.

The main class CLIPModel handles model management and provides:

Automatic discovery of available model variants
ONNX runtime integration for efficient inference
Preprocessing pipelines for images and text
Similarity calculation and prediction methods

Typical usage patterns:

Direct API usage through clip_image_encode/clip_text_encode/clip_predict functions
Instance-based control via CLIPModel class
Web demo deployment through launch_demo method

Note

For optimal performance with multiple models, reuse CLIPModel instances when possible. The module implements LRU caching for model instances based on repository ID.

CLIPModel

class imgutils.generic.clip.CLIPModel(repo_id: str, hf_token: str | None = None)[source]

Main interface for CLIP model operations.

This class provides thread-safe access to CLIP model variants stored in a Hugging Face repository. It handles model loading, preprocessing, inference, and provides web interface capabilities.

Parameters:

repo_id (str) – Hugging Face repository ID containing CLIP models
hf_token (Optional[str]) – Optional authentication token for private repositories

Note

Model components are loaded on-demand and cached for subsequent use. Use clear() method to free memory when working with multiple large models.

__init__(repo_id: str, hf_token: str | None = None)[source]

clear()[source]

Clear all cached models and components.

Use this to free memory when switching between different model variants.

Encode images into CLIP embeddings.

Parameters:

images (MultiImagesTyping) – Input images (paths, URLs, PIL images, or numpy arrays)
model_name (str) – Target model variant name
fmt (Any) – Output format specification. Can be: - ‘embeddings’ (default): Return normalized embeddings - ‘encodings’: Return raw model outputs - Tuple of both: Return (embeddings, encodings)

Returns:

Encoded features in specified format

Return type:

Any

Note

Input images are automatically converted to RGB format with white background.

launch_demo(default_model_name: str | None = None, server_name: str | None = None, server_port: int | None = None, **kwargs)[source]

Launch a Gradio web interface for interactive CLIP model predictions.

Creates and launches a web demo that allows users to upload images, enter text labels, and get similarity predictions using the CLIP model. The interface includes model information and repository links.

Parameters:

default_model_name (Optional[str]) – Initial model variant to select in the dropdown
server_name (Optional[str]) – Host address to bind the server to (e.g., “0.0.0.0” for public access)
server_port (Optional[int]) – Port number to run the server on
kwargs – Additional keyword arguments passed to gradio.launch()

Returns:

None

Usage:

>>> model = CLIPModel("organization/model-name")
>>> model.launch_demo(server_name="0.0.0.0", server_port=7860)

make_ui(default_model_name: str | None = None)[source]

Create Gradio interface components for an interactive CLIP model demo.

This method sets up a user interface with image input, text input for labels, model selection dropdown, and prediction display. It automatically selects the most recently updated model variant if no default is specified.

Parameters:: default_model_name (Optional[str]) – Optional name of the model variant to select by default. If None, the most recently updated model variant will be selected.
Returns:: None

Calculate similarity predictions between images and texts.

Parameters:

images (Union[MultiImagesTyping, np.ndarray]) – Input images or precomputed embeddings
texts (Union[List[str], str, np.ndarray]) – Input texts or precomputed embeddings
model_name (str) – Target model variant name
fmt (Any) –
Output format specification. Can be: - ‘predictions’ (default): Normalized probability scores - ‘similarities’: Cosine similarities - ‘logits’: Scaled similarity scores - Complex format using dict keys:

(‘image_embeddings’, ‘text_embeddings’, ‘similarities’, etc.)

Returns:

Prediction results in specified format

Return type:

Any

Note

When passing precomputed embeddings, ensure they are L2-normalized

text_encode(texts: str | List[str], model_name: str, fmt: Any = 'embeddings')[source]

Encode text into CLIP embeddings.

Parameters:

texts (Union[str, List[str]]) – Input text or list of texts
model_name (str) – Target model variant name
fmt (Any) – Output format specification. Can be: - ‘embeddings’ (default): Return normalized embeddings - ‘encodings’: Return raw model outputs - Tuple of both: Return (embeddings, encodings)

Returns:

Encoded features in specified format

Return type:

Any

clip_image_encode

Generate CLIP embeddings or features for the given images.

Parameters:

images (MultiImagesTyping) – Input images (paths, PIL Images, or numpy arrays)
repo_id (str) – Hugging Face model repository ID
model_name (str) – Name of the specific model variant to use
fmt (Any) – Output format (‘embeddings’ or ‘features’)
hf_token (Optional[str]) – Optional Hugging Face API token

Returns:

Image embeddings or features

clip_text_encode

imgutils.generic.clip.clip_text_encode(texts: str | List[str], repo_id: str, model_name: str, fmt: Any = 'embeddings', hf_token: str | None = None)[source]

Generate CLIP embeddings or features for the given texts.

Parameters:

texts (Union[str, List[str]]) – Input text or list of texts
repo_id (str) – Hugging Face model repository ID
model_name (str) – Name of the specific model variant to use
fmt (Any) – Output format (‘embeddings’ or ‘features’)
hf_token (Optional[str]) – Optional Hugging Face API token

Returns:

Text embeddings or features

clip_predict

Calculate similarity scores between images and texts using CLIP.

This function computes the similarity between the given images and texts using the specified CLIP model. It can accept raw images/texts or pre-computed embeddings as input.

Parameters:

images (Union[MultiImagesTyping, np.ndarray]) – Input images or pre-computed image embeddings
texts (Union[List[str], str, np.ndarray]) – Input texts or pre-computed text embeddings
repo_id (str) – Hugging Face model repository ID
model_name (str) – Name of the specific model variant to use
fmt (Any) – Output format (‘predictions’ for similarity scores or ‘logits’ for raw logits)
hf_token (Optional[str]) – Optional Hugging Face API token

Returns:

Similarity scores or logits between images and texts