Torch 203: Pre-trained Models and Visual Inference

Building a major model from scratch takes massive datasets and weeks of GPU time. TorchVision offers a professional shortcut: state-of-the-art pre-trained models trained on ImageNet, COCO, and other benchmark datasets. The content/ipynb/C2_M2_Lab_3_torchvision_3.ipynb notebook demonstrates how to use these expert models for immediate inference on classification, segmentation, and object detection tasks. This post distills that workflow into a practical reference.

Why Pre-trained Models Matter

Training ResNet-50 or Faster R-CNN from scratch requires:

Millions of labeled images
Days of distributed GPU compute
Careful hyperparameter tuning

Pre-trained models give you that learned knowledge instantly. You download weights, run inference, and get production-grade results in minutes—not months.

The Two-Step Investigation: Know Your Model

Before running inference, answer two critical questions:

How many classes can this model predict?
What are those class names?

Modern TorchVision models embed this information in .meta attributes. For example, checking DeepLabV3_ResNet50_Weights.DEFAULT:

import torchvision.models as tv_models

# Load model and weights object
seg_model_weights = tv_models.segmentation.DeepLabV3_ResNet50_Weights.DEFAULT

# Access metadata
if hasattr(seg_model_weights, 'meta') and "categories" in seg_model_weights.meta:
    class_names = seg_model_weights.meta["categories"]
    print(f"Model recognizes {len(class_names)} classes")
    print(class_names)  # ['background', 'aeroplane', 'bicycle', 'bird', ...]

This confirms that 'dog' is in the class list before you waste time on inference.

Legacy Models: Manual Detective Work

Older models loaded via pretrained=True lack the .meta attribute. You inspect the architecture directly:

resnet50_model = tv_models.resnet50(pretrained=True)
print(resnet50_model)  # Look for the final layer

# Output includes: (fc): Linear(in_features=2048, out_features=1000, bias=True)
num_classes = resnet50_model.fc.out_features  # 1000

Then find the class names in external files like imagenet_class_index.json. This manual step is only necessary for legacy loading methods—modern weights objects handle it automatically.

Visualizing Predictions: Draw What You Detect

Raw tensor outputs are abstract until you draw them. TorchVision provides two essential utilities:

1. Bounding Boxes

Object detection models return coordinates. draw_bounding_boxes turns them into visual frames:

from torchvision.utils import draw_bounding_boxes
from torchvision.io import decode_image

image = decode_image('./dog1.jpg')
boxes = torch.tensor([[140, 30, 375, 315], [200, 70, 230, 110]], dtype=torch.float)
labels = ["dog", "eye"]

result = draw_bounding_boxes(
    image=image,
    boxes=boxes,  # Shape: (N, 4) for (xmin, ymin, xmax, ymax)
    labels=labels,
    colors=["red", "blue"],
    width=3
)

This is the exact technique used in self-driving car dashboards and automated checkout systems.

2. Segmentation Masks

For pixel-perfect object boundaries, draw_segmentation_masks overlays boolean masks:

from torchvision.utils import draw_segmentation_masks

# object_mask: boolean tensor of shape (1, H, W)
result = draw_segmentation_masks(
    image=image,
    masks=object_mask,
    alpha=0.5,  # Transparency
    colors=["blue"]
)

Medical imaging and autonomous driving rely on this precision to outline tumors or road boundaries.

Inference Workflow 1: Image Segmentation

Using DeepLabV3_ResNet50 to find and mask a dog:

from PIL import Image
from torchvision import transforms

# 1. Load model
seg_model = tv_models.segmentation.deeplabv3_resnet50(
    weights=tv_models.segmentation.DeepLabV3_ResNet50_Weights.DEFAULT
).eval()

# 2. Prepare tensors
img = Image.open('./dog2.jpg')
original_image_tensor = transforms.ToTensor()(img)

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
input_tensor = normalize(original_image_tensor).unsqueeze(0)

# 3. Inference
with torch.no_grad():
    output = seg_model(input_tensor)['out'][0]

# 4. Generate mask
output_predictions = output.argmax(0)  # Best class per pixel
dog_class_idx = class_names.index('dog')
dog_mask = (output_predictions == dog_class_idx).unsqueeze(0)

# 5. Visualize
result = draw_segmentation_masks(
    image=(original_image_tensor * 255).byte(),
    masks=dog_mask,
    alpha=0.5,
    colors=["blue"]
)

The model outputs a tensor of shape (num_classes, H, W) with scores per pixel. You take the argmax to get the winning class, then filter for your target.

Inference Workflow 2: Image Classification

Using legacy ResNet50 to predict the main subject:

# 1. Load model and class names
resnet50_model = tv_models.resnet50(pretrained=True).eval()
imagenet_classes = load_imagenet_classes('./imagenet_class_index.json')

# 2. Transform image to model's expected format
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_batch = transform(img).unsqueeze(0)

# 3. Inference
with torch.no_grad():
    output = resnet50_model(input_batch)

# 4. Convert to probabilities and get top predictions
probabilities = torch.nn.functional.softmax(output[0], dim=0)
top_prob, top_catid = torch.topk(probabilities, 5)

# 5. Display results
for i in range(5):
    class_id_str = str(top_catid[i].item())
    class_name = imagenet_classes[class_id_str][1]
    confidence = top_prob[i].item() * 100
    print(f"Top-{i+1}: {class_name} ({confidence:.2f}%)")

The model correctly identifies 'golden_retriever' with high confidence.

Object Detection: Faster R-CNN in Action

For detecting multiple objects with bounding boxes:

# Load Faster R-CNN model
bb_model_weights = tv_models.detection.FasterRCNN_ResNet50_FPN_Weights.DEFAULT
bb_model = tv_models.detection.fasterrcnn_resnet50_fpn(weights=bb_model_weights).eval()

# Define targets
target_class_names = ['car', 'traffic light']
bbox_colors = ['red', 'blue']

# Inference
tensor_image_batch = transforms.ToTensor()(pil_image).unsqueeze(0)
with torch.no_grad():
    prediction = bb_model(tensor_image_batch)[0]

# Filter by class and confidence threshold
for index, label, color in zip(object_indices, target_class_names, bbox_colors):
    class_mask = (prediction['labels'] == index) & (prediction['scores'] > 0.7)
    boxes = prediction['boxes'][class_mask]
    # Draw boxes...

The model returns dictionaries with 'boxes', 'labels', and 'scores' keys—filter and visualize as needed.

Key Architecture Catalog

TorchVision organizes models by task:

Classification: ResNet, VGG, MobileNetV3, DenseNet
Segmentation: FCN, DeepLabV3
Object Detection: Faster R-CNN, RetinaNet, SSD
Video: R(2+1)D, MC3, Video MViT

Always use .DEFAULT weights for the current best practice, and check the .meta attribute first.

Critical Habits for Production Inference

Match preprocessing to training data: Models trained on ImageNet expect 224×224 inputs with specific normalization. Mismatches degrade performance silently.
Check class lists before inference: Avoid wasting compute on models that don’t recognize your target objects.
Use .eval() mode: Disables dropout and batch normalization updates during inference.
Wrap inference in torch.no_grad(): Prevents gradient computation, saving memory and speeding up predictions.
Validate with visualization: Always draw bounding boxes or masks to verify model behavior before trusting predictions.

Where to Go Next

Transfer Learning: Fine-tune these pre-trained models on custom datasets (coming in the next lab).
Custom Datasets: Wrap your data in DataLoader and adapt model heads for specialized classes.
Model Export: Convert to ONNX or TorchScript for deployment to edge devices.

Pre-trained models aren’t a shortcut—they’re the standard, efficient path to production vision systems. Master inference first, then customize through fine-tuning.