imgutils.generic.clip

CLIP model interface for multimodal embeddings and predictions.

This module provides a comprehensive interface for working with CLIP models hosted on Hugging Face Hub.

The main class CLIPModel handles model management and provides:

  • Automatic discovery of available model variants

  • ONNX runtime integration for efficient inference

  • Preprocessing pipelines for images and text

  • Similarity calculation and prediction methods

Typical usage patterns:

  1. Direct API usage through clip_image_encode/clip_text_encode/clip_predict functions

  2. Instance-based control via CLIPModel class

  3. Web demo deployment through launch_demo method

Note

For optimal performance with multiple models, reuse CLIPModel instances when possible. The module implements LRU caching for model instances based on repository ID.

CLIPModel

class imgutils.generic.clip.CLIPModel(repo_id: str, hf_token: str | None = None)[source]

Main interface for CLIP model operations.

This class provides thread-safe access to CLIP model variants stored in a Hugging Face repository. It handles model loading, preprocessing, inference, and provides web interface capabilities.

Parameters:
  • repo_id (str) – Hugging Face repository ID containing CLIP models

  • hf_token (Optional[str]) – Optional authentication token for private repositories

Note

Model components are loaded on-demand and cached for subsequent use. Use clear() method to free memory when working with multiple large models.

__init__(repo_id: str, hf_token: str | None = None)[source]
clear()[source]

Clear all cached models and components.

Use this to free memory when switching between different model variants.

image_encode(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...], model_name: str, fmt: Any = 'embeddings')[source]

Encode images into CLIP embeddings.

Parameters:
  • images (MultiImagesTyping) – Input images (paths, URLs, PIL images, or numpy arrays)

  • model_name (str) – Target model variant name

  • fmt (Any) – Output format specification. Can be: - ‘embeddings’ (default): Return normalized embeddings - ‘encodings’: Return raw model outputs - Tuple of both: Return (embeddings, encodings)

Returns:

Encoded features in specified format

Return type:

Any

Note

Input images are automatically converted to RGB format with white background.

launch_demo(default_model_name: str | None = None, server_name: str | None = None, server_port: int | None = None, **kwargs)[source]

Launch a Gradio web interface for interactive CLIP model predictions.

Creates and launches a web demo that allows users to upload images, enter text labels, and get similarity predictions using the CLIP model. The interface includes model information and repository links.

Parameters:
  • default_model_name (Optional[str]) – Initial model variant to select in the dropdown

  • server_name (Optional[str]) – Host address to bind the server to (e.g., “0.0.0.0” for public access)

  • server_port (Optional[int]) – Port number to run the server on

  • kwargs – Additional keyword arguments passed to gradio.launch()

Returns:

None

Usage:
>>> model = CLIPModel("organization/model-name")
>>> model.launch_demo(server_name="0.0.0.0", server_port=7860)
make_ui(default_model_name: str | None = None)[source]

Create Gradio interface components for an interactive CLIP model demo.

This method sets up a user interface with image input, text input for labels, model selection dropdown, and prediction display. It automatically selects the most recently updated model variant if no default is specified.

Parameters:

default_model_name (Optional[str]) – Optional name of the model variant to select by default. If None, the most recently updated model variant will be selected.

Returns:

None

predict(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...] | ndarray, texts: List[str] | str | ndarray, model_name: str, fmt='predictions')[source]

Calculate similarity predictions between images and texts.

Parameters:
  • images (Union[MultiImagesTyping, np.ndarray]) – Input images or precomputed embeddings

  • texts (Union[List[str], str, np.ndarray]) – Input texts or precomputed embeddings

  • model_name (str) – Target model variant name

  • fmt (Any) –

    Output format specification. Can be: - ‘predictions’ (default): Normalized probability scores - ‘similarities’: Cosine similarities - ‘logits’: Scaled similarity scores - Complex format using dict keys:

    (‘image_embeddings’, ‘text_embeddings’, ‘similarities’, etc.)

Returns:

Prediction results in specified format

Return type:

Any

Note

When passing precomputed embeddings, ensure they are L2-normalized

text_encode(texts: str | List[str], model_name: str, fmt: Any = 'embeddings')[source]

Encode text into CLIP embeddings.

Parameters:
  • texts (Union[str, List[str]]) – Input text or list of texts

  • model_name (str) – Target model variant name

  • fmt (Any) – Output format specification. Can be: - ‘embeddings’ (default): Return normalized embeddings - ‘encodings’: Return raw model outputs - Tuple of both: Return (embeddings, encodings)

Returns:

Encoded features in specified format

Return type:

Any

clip_image_encode

imgutils.generic.clip.clip_image_encode(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...], repo_id: str, model_name: str, fmt: Any = 'embeddings', hf_token: str | None = None)[source]

Generate CLIP embeddings or features for the given images.

Parameters:
  • images (MultiImagesTyping) – Input images (paths, PIL Images, or numpy arrays)

  • repo_id (str) – Hugging Face model repository ID

  • model_name (str) – Name of the specific model variant to use

  • fmt (Any) – Output format (‘embeddings’ or ‘features’)

  • hf_token (Optional[str]) – Optional Hugging Face API token

Returns:

Image embeddings or features

clip_text_encode

imgutils.generic.clip.clip_text_encode(texts: str | List[str], repo_id: str, model_name: str, fmt: Any = 'embeddings', hf_token: str | None = None)[source]

Generate CLIP embeddings or features for the given texts.

Parameters:
  • texts (Union[str, List[str]]) – Input text or list of texts

  • repo_id (str) – Hugging Face model repository ID

  • model_name (str) – Name of the specific model variant to use

  • fmt (Any) – Output format (‘embeddings’ or ‘features’)

  • hf_token (Optional[str]) – Optional Hugging Face API token

Returns:

Text embeddings or features

clip_predict

imgutils.generic.clip.clip_predict(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...] | ndarray, texts: List[str] | str | ndarray, repo_id: str, model_name: str, fmt: Any = 'predictions', hf_token: str | None = None)[source]

Calculate similarity scores between images and texts using CLIP.

This function computes the similarity between the given images and texts using the specified CLIP model. It can accept raw images/texts or pre-computed embeddings as input.

Parameters:
  • images (Union[MultiImagesTyping, np.ndarray]) – Input images or pre-computed image embeddings

  • texts (Union[List[str], str, np.ndarray]) – Input texts or pre-computed text embeddings

  • repo_id (str) – Hugging Face model repository ID

  • model_name (str) – Name of the specific model variant to use

  • fmt (Any) – Output format (‘predictions’ for similarity scores or ‘logits’ for raw logits)

  • hf_token (Optional[str]) – Optional Hugging Face API token

Returns:

Similarity scores or logits between images and texts