imgutils.generic.clip
CLIP model interface for multimodal embeddings and predictions.
This module provides a comprehensive interface for working with CLIP models hosted on Hugging Face Hub.
The main class CLIPModel handles model management and provides:
Automatic discovery of available model variants
ONNX runtime integration for efficient inference
Preprocessing pipelines for images and text
Similarity calculation and prediction methods
Typical usage patterns:
Direct API usage through clip_image_encode/clip_text_encode/clip_predict functions
Instance-based control via CLIPModel class
Web demo deployment through launch_demo method
Note
For optimal performance with multiple models, reuse CLIPModel instances when possible. The module implements LRU caching for model instances based on repository ID.
CLIPModel
- class imgutils.generic.clip.CLIPModel(repo_id: str, hf_token: str | None = None)[source]
Main interface for CLIP model operations.
This class provides thread-safe access to CLIP model variants stored in a Hugging Face repository. It handles model loading, preprocessing, inference, and provides web interface capabilities.
- Parameters:
repo_id (str) – Hugging Face repository ID containing CLIP models
hf_token (Optional[str]) – Optional authentication token for private repositories
Note
Model components are loaded on-demand and cached for subsequent use. Use clear() method to free memory when working with multiple large models.
- clear()[source]
Clear all cached models and components.
Use this to free memory when switching between different model variants.
- image_encode(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...], model_name: str, fmt: Any = 'embeddings')[source]
Encode images into CLIP embeddings.
- Parameters:
images (MultiImagesTyping) – Input images (paths, URLs, PIL images, or numpy arrays)
model_name (str) – Target model variant name
fmt (Any) – Output format specification. Can be: - ‘embeddings’ (default): Return normalized embeddings - ‘encodings’: Return raw model outputs - Tuple of both: Return (embeddings, encodings)
- Returns:
Encoded features in specified format
- Return type:
Any
Note
Input images are automatically converted to RGB format with white background.
- launch_demo(default_model_name: str | None = None, server_name: str | None = None, server_port: int | None = None, **kwargs)[source]
Launch a Gradio web interface for interactive CLIP model predictions.
Creates and launches a web demo that allows users to upload images, enter text labels, and get similarity predictions using the CLIP model. The interface includes model information and repository links.
- Parameters:
default_model_name (Optional[str]) – Initial model variant to select in the dropdown
server_name (Optional[str]) – Host address to bind the server to (e.g., “0.0.0.0” for public access)
server_port (Optional[int]) – Port number to run the server on
kwargs – Additional keyword arguments passed to gradio.launch()
- Returns:
None
- Usage:
>>> model = CLIPModel("organization/model-name") >>> model.launch_demo(server_name="0.0.0.0", server_port=7860)
- make_ui(default_model_name: str | None = None)[source]
Create Gradio interface components for an interactive CLIP model demo.
This method sets up a user interface with image input, text input for labels, model selection dropdown, and prediction display. It automatically selects the most recently updated model variant if no default is specified.
- Parameters:
default_model_name (Optional[str]) – Optional name of the model variant to select by default. If None, the most recently updated model variant will be selected.
- Returns:
None
- predict(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...] | ndarray, texts: List[str] | str | ndarray, model_name: str, fmt='predictions')[source]
Calculate similarity predictions between images and texts.
- Parameters:
images (Union[MultiImagesTyping, np.ndarray]) – Input images or precomputed embeddings
texts (Union[List[str], str, np.ndarray]) – Input texts or precomputed embeddings
model_name (str) – Target model variant name
fmt (Any) –
Output format specification. Can be: - ‘predictions’ (default): Normalized probability scores - ‘similarities’: Cosine similarities - ‘logits’: Scaled similarity scores - Complex format using dict keys:
(‘image_embeddings’, ‘text_embeddings’, ‘similarities’, etc.)
- Returns:
Prediction results in specified format
- Return type:
Any
Note
When passing precomputed embeddings, ensure they are L2-normalized
- text_encode(texts: str | List[str], model_name: str, fmt: Any = 'embeddings')[source]
Encode text into CLIP embeddings.
- Parameters:
texts (Union[str, List[str]]) – Input text or list of texts
model_name (str) – Target model variant name
fmt (Any) – Output format specification. Can be: - ‘embeddings’ (default): Return normalized embeddings - ‘encodings’: Return raw model outputs - Tuple of both: Return (embeddings, encodings)
- Returns:
Encoded features in specified format
- Return type:
Any
clip_image_encode
- imgutils.generic.clip.clip_image_encode(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...], repo_id: str, model_name: str, fmt: Any = 'embeddings', hf_token: str | None = None)[source]
Generate CLIP embeddings or features for the given images.
- Parameters:
images (MultiImagesTyping) – Input images (paths, PIL Images, or numpy arrays)
repo_id (str) – Hugging Face model repository ID
model_name (str) – Name of the specific model variant to use
fmt (Any) – Output format (‘embeddings’ or ‘features’)
hf_token (Optional[str]) – Optional Hugging Face API token
- Returns:
Image embeddings or features
clip_text_encode
- imgutils.generic.clip.clip_text_encode(texts: str | List[str], repo_id: str, model_name: str, fmt: Any = 'embeddings', hf_token: str | None = None)[source]
Generate CLIP embeddings or features for the given texts.
- Parameters:
texts (Union[str, List[str]]) – Input text or list of texts
repo_id (str) – Hugging Face model repository ID
model_name (str) – Name of the specific model variant to use
fmt (Any) – Output format (‘embeddings’ or ‘features’)
hf_token (Optional[str]) – Optional Hugging Face API token
- Returns:
Text embeddings or features
clip_predict
- imgutils.generic.clip.clip_predict(images: str | PathLike | bytes | bytearray | BinaryIO | Image | List[str | PathLike | bytes | bytearray | BinaryIO | Image] | Tuple[str | PathLike | bytes | bytearray | BinaryIO | Image, ...] | ndarray, texts: List[str] | str | ndarray, repo_id: str, model_name: str, fmt: Any = 'predictions', hf_token: str | None = None)[source]
Calculate similarity scores between images and texts using CLIP.
This function computes the similarity between the given images and texts using the specified CLIP model. It can accept raw images/texts or pre-computed embeddings as input.
- Parameters:
images (Union[MultiImagesTyping, np.ndarray]) – Input images or pre-computed image embeddings
texts (Union[List[str], str, np.ndarray]) – Input texts or pre-computed text embeddings
repo_id (str) – Hugging Face model repository ID
model_name (str) – Name of the specific model variant to use
fmt (Any) – Output format (‘predictions’ for similarity scores or ‘logits’ for raw logits)
hf_token (Optional[str]) – Optional Hugging Face API token
- Returns:
Similarity scores or logits between images and texts