Archive Image Matching

Visual Matching in Historical Print Catalogues

The goal is to build a tool that can take an input image (e.g., Fig. 1) and, with optional filters such as location or date range to narrow the search, automatically scans the e-rara archive for visually similar pages. The tool should return direct links to matching results, enabling researchers and users to quickly identify recurring motifs, printer’s devices, illustrations, or other visual elements across the archive.

Presentation link

Inputs

The dataset for this challenge is provided by e-rara.ch, which hosts digitized versions of historical books and offers an API for image access. The full archive contains over 154'000 titles and millions of scanned pages. However, for practical purposes, researchers often limit their scope to fewer than 100 titles, amounting to a few thousand pages - making local processing feasible. Although processing on the Ubelix cluster is also a possibility.

Goals

Art historians and scholars in related fields would greatly benefit from the ability to search for visually similar images within large catalogues of historical prints. A particularly valuable use case is identifying recurring visual elements - such as printer's imprints - across different books and editions.

For optimal relevance, the matching should account for different visual variations, such as:

Different sizes
Mirroring or rotation
Ink smudges or degradation
Colorization

Constraints & Considerations

Approaches using image classifiers, local feature descriptors, or other vision methods are welcome.
A fast matching algorithm is required given the large amount of fetched images.
Solutions that do not require a GPU and can run locally are especially encouraged.
Creativity in lightweight or approximate matching is valued.

Team

Our team will ideally include:

Computer Vision engineer: interested in image processing, feature extraction, and pattern detection.
Backend engineer: someone with expertise in working with APIs and cloud data extraction.
Usability engineer: a designer interested in creating a web-based UI for our a tool.

Hackathon Solution

Our team developed an innovative method to address the limitations of traditional feature extraction techniques in the context of scanned documents.

A wide variety of feature extraction algorithms have been proposed for processing images. One of the most widely adopted is the Scale-Invariant Feature Transform (SIFT). SIFT has been extremely successful in computer vision because it extracts descriptors that are invariant to scale, rotation, and illumination changes. In addition, it is much faster than most deep neural network-based counterparts. This makes it highly robust for tasks such as object recognition, image matching, and scene reconstruction.

However, applying SIFT directly to scanned documents introduces significant challenges. Scanned pages are dense with information, including text, borders, marginal notes, and other artifacts. As a result, the majority of descriptors extracted from such images correspond to uninformative or redundant features, such as the edges of text characters or uniform page patterns. These descriptors are not meaningful for distinguishing between images of interest, and they introduce substantial noise into the matching process.

To overcome this limitation, our team designed a novel solution inspired by techniques from information retrieval. We applied term frequency–inverse document frequency (TF-IDF) weighting to the extracted descriptors. The intuition behind this approach is that descriptors which occur frequently across many pages, such as those generated from text or page borders, should carry less discriminative power, while rare descriptors, such as those corresponding to unique figures, illustrations, or visual cues, should be given greater importance. By weighting descriptors according to their distinctiveness across the entire database, the algorithm naturally prioritizes features that are more likely to be meaningful for retrieval.

Once descriptors are weighted, we organize them into a hierarchical verbal tree structure. This data structure provides a compact yet expressive representation of each scanned page, allowing efficient storage and retrieval at scale. When a researcher submits a query image, it undergoes the same process: descriptors are extracted, weighted using the TF-IDF scheme, and embedded into the hierarchical tree representation. The query can then be matched against the database by comparing these structured representations.

This approach yields several advantages:

Noise reduction: Irrelevant descriptors from text and borders are down-weighted.
Discriminative focus: Unique image features, such as illustrations or diagrams, gain higher priority in matching.
Scalability: The hierarchical structure allows efficient indexing and retrieval, even in very large collections of scanned pages.
Robustness: The method maintains the core strengths of SIFT (scale, rotation, and illumination invariance) while tailoring the representation to the challenges of scanned documents.

By combining established computer vision techniques with concepts from information retrieval, our team created a system that significantly improves the accuracy and efficiency of image retrieval in large collections of scanned documents.

Contacts

For any question you can contact matteo.boi@unibe.ch

This challenge originates from Torben Hanhart at the Institute of Art History, University of Bern.

Fig. 1: Printer’s imprint used in Bern, ca. 1400–1600. Example reference image, with the corresponding correct match identified within the archive.

E-rara Image Matchmaking API

A FastAPI-based service for searching and retrieving historical images from the e-rara digital library using bibliographic criteria and optional reference images.

Overview

This API provides an IMAGE_MATCHMAKING operation that allows clients to:

Search e-rara's collection using metadata filters (author, title, place, publisher, date range)
Upload reference images for similarity matching
Receive both thumbnail and full-resolution image URLs
Handle large result sets asynchronously with job polling or SSE streaming
Smart page selection to avoid book covers and prioritize content pages

Features

Dual input support - Accepts both JSON and multipart form-data
Smart page filtering - Automatically skips cover pages and selects content pages
IIIF image URLs - Returns proper thumbnail and full-resolution URLs
Manifest integration - Expands records to individual pages with full page ID arrays
Async processing - Background jobs for large result sets (>100 images)
Streaming support - Server-Sent Events (SSE) for real-time progress
Comprehensive validation - Input validation, image URL verification, error handling
Rich metadata - Returns record IDs, page counts, manifest URLs, and complete page arrays
Flexible field mapping - Supports various field name formats (e.g., "Printer / Publisher", "printer/publisher")

Quick Start

Prerequisites

pip install fastapi uvicorn requests beautifulsoup4 python-multipart pydantic

Running the API

uvicorn image_matchmaking_api:app --reload

The API will be available at:

Base URL: http://127.0.0.1:8000
Interactive docs: http://127.0.0.1:8000/docs
OpenAPI spec: http://127.0.0.1:8000/openapi.json

Recent Updates (v2.0)

🎯 Smart Page Selection

Automatic cover filtering: No more book covers! API now selects content pages by default
Intelligent page targeting: Selects pages from middle content sections
Configurable strategies: Choose between content, first page, or random selection

📝 JSON API Support

Modern JSON requests: Clean, structured requests instead of form data
Flexible field mapping: Supports various field name formats
Better validation: Pydantic models for request validation

🔧 Enhanced Criteria Processing

Fixed field mapping: "Printer / Publisher" and similar variations now work correctly
Case-insensitive matching: Field names are normalized automatically
Multiple format support: Handle different naming conventions seamlessly

API Endpoints

POST `/api/v1/matchmaking/images/search`

Main search endpoint supporting both JSON and form-data input.

JSON Request Format (Recommended)

{
  "operation": "IMAGE_MATCHMAKING",
  "criteria": [
    {
      "field": "Printer / Publisher",
      "value": "Bern*"
    },
    {
      "field": "Place", 
      "value": "Basel"
    }
  ],
  "from_date": "1600",
  "until_date": "1620",
  "maxResults": 10,
  "avoid_covers": true,
  "page_selection": "content"
}

New JSON Parameters

avoid_covers (boolean, default: true): Skip book covers and select content pages
page_selection (string, default: "content"): Page selection strategy
- "content": Smart content page selection (skips covers)
- "first": Original behavior (first page, likely cover)
- "random": Random page selection

Performance Parameters

validate_images (boolean, default: true): Verify image accessibility
- true: Ensures all returned images are accessible (slower but more reliable)
- false: Skip validation for 30-50% speed improvement
max_workers (integer, default: 4): Concurrent processing threads for multi-record requests

POST `/api/v1/matchmaking/images/search/form`

Legacy form-data endpoint for backward compatibility.

Required Fields

operation (string): Must be "IMAGE_MATCHMAKING"
projectId (string): Project identifier
agentId (string): Agent identifier

Optional Fields

conversationId (string): UUID for traceability
from_date (string): Start year (YYYY format)
until_date (string): End year (YYYY format)
maxResults (integer): Maximum number of results
pageSize (integer): Page size for pagination
includeMetadata (boolean): Include metadata (default: true)
responseFormat (string): "json" or "stream"
locale (string): Language preference
criteria (array): Search criteria in format "field:value:operator"
uploadedImage (files): Reference images for similarity matching

Synchronous Response (≤100 results)

{
  "images": [
    {
      "recordId": "6100663",
      "pageId": "6100665",
      "thumbnailUrl": "https://www.e-rara.ch/i3f/v21/6100665/full/,150/0/default.jpg",
      "fullImageUrl": "https://www.e-rara.ch/i3f/v21/6100665/full/full/0/default.jpg",
      "pageCount": 372,
      "pageIds": ["6100665", "6100666", "6100667", "..."],
      "manifest": "https://www.e-rara.ch/i3f/v21/6100663/manifest"
    }
  ],
  "count": 1
}

Async Response (>100 results)

{
  "jobId": "uuid-string",
  "status": "pending"
}

GET `/api/v1/matchmaking/images/results`

Poll for async job results.

Parameters:

jobId (required): Job identifier
pageToken (optional): Pagination token

GET `/api/v1/matchmaking/images/stream`

Server-Sent Events stream for async job progress.

Parameters:

jobId (required): Job identifier

Search Criteria

Supported Fields

The API supports flexible field name formats for better usability:

Title: "Title", "title"
Author: "Author", "Creator", "author", "creator"
Place: "Place", "Publication Place", "Origin Place", "place"
Publisher: "Publisher", "Printer", "Printer / Publisher", "printer/publisher"

Smart Page Selection

NEW: The API now intelligently selects content pages instead of covers:

Default behavior: Automatically skips first 2-3 pages (covers, title pages)
Content targeting: Selects pages from the middle content section
Adaptive logic: Adjusts skip amounts based on document length
Short document handling: For documents ≤3 pages, returns first page

Example impact:

100-page book: Skips pages 1-3, selects around page 35-40
20-page pamphlet: Skips page 1-2, selects around page 8
Result: ~80% reduction in cover images returned

Date Filtering

from_date - Start year (e.g., "1600")
until_date - End year (e.g., "1700")
Automatic splitting for ranges >400 years

Error Handling

HTTP Status Codes

200 - Success
400 - Validation error
404 - Job not found
413 - Payload too large
415 - Unsupported media type
422 - Unsupported field
429 - Rate limit exceeded
500 - Internal server error

Error Response Format

{
  "error": "VALIDATION_ERROR",
  "details": [
    {
      "field": "from_date",
      "message": "Year must be 4 digits"
    }
  ]
}

Usage Examples

JSON Request (Recommended)

curl -X POST "http://127.0.0.1:8000/api/v1/matchmaking/images/search" \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "IMAGE_MATCHMAKING",
    "criteria": [
      {
        "field": "Printer / Publisher",
        "value": "Bern*"
      }
    ],
    "from_date": "1600",
    "until_date": "1620",
    "maxResults": 5,
    "avoid_covers": true
  }'

Form Data Request (Legacy)

curl -X POST "http://127.0.0.1:8000/api/v1/matchmaking/images/search/form" \
  -F "operation=IMAGE_MATCHMAKING" \
  -F "projectId=demo" \
  -F "agentId=demo" \
  -F "from_date=1600" \
  -F "until_date=1650" \
  -F "maxResults=5"

Search with Multiple Criteria

curl -X POST "http://127.0.0.1:8000/api/v1/matchmaking/images/search" \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "IMAGE_MATCHMAKING",
    "criteria": [
      {
        "field": "Title",
        "value": "Historia*"
      },
      {
        "field": "Place", 
        "value": "Basel"
      }
    ],
    "from_date": "1600",
    "until_date": "1700",
    "maxResults": 10,
    "page_selection": "content"
  }'

JavaScript Frontend Integration

async function searchImages() {
  const requestData = {
    operation: 'IMAGE_MATCHMAKING',
    criteria: [
      {
        field: 'Printer / Publisher',
        value: 'Bern*'
      }
    ],
    from_date: '1600',
    until_date: '1700',
    maxResults: 10,
    avoid_covers: true,
    page_selection: 'content'
  };

  const response = await fetch('/api/v1/matchmaking/images/search', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json'
    },
    body: JSON.stringify(requestData)
  });

  const data = await response.json();
  
  if (data.images) {
    // Synchronous results
    renderImages(data.images);
  } else if (data.jobId) {
    // Async job - poll for results
    pollJobResults(data.jobId);
  }
}

function renderImages(images) {
  images.forEach(img => {
    // Show thumbnail first
    const thumbnail = document.createElement('img');
    thumbnail.src = img.thumbnailUrl;
    thumbnail.onclick = () => {
      // Load full image on click
      thumbnail.src = img.fullImageUrl;
    };
    document.body.appendChild(thumbnail);
  });
}

Image URL Patterns

IIIF URL Structure

Thumbnail: https://www.e-rara.ch/i3f/v21/{pageId}/full/,150/0/default.jpg
Full size: https://www.e-rara.ch/i3f/v21/{pageId}/full/full/0/default.jpg
Custom size: https://www.e-rara.ch/i3f/v21/{pageId}/full/,{height}/0/default.jpg

Size Options

full - Original dimensions
,150 - Height constrained to 150px
300, - Width constrained to 300px
!300,300 - Fit within 300×300 box
pct:25 - 25% of original size

Development

Project Structure

├── image_matchmaking_api.py    # Main FastAPI application
├── e_rara_id_fetcher.py       # E-rara search logic
├── e_rara_image_downloader_hack.py  # IIIF manifest processing
├── README.md                  # This file
└── read.md                   # Original API specification

Dependencies

FastAPI - Web framework
Uvicorn - ASGI server
Requests - HTTP client
BeautifulSoup4 - HTML/XML parsing
python-multipart - Form data handling

Adding Features

To extend the API:

New search criteria: Update parse_criteria() function
Image processing: Integrate with vision models in process_job()
Caching: Add Redis/memory cache for manifest data
Authentication: Add JWT/API key middleware
Rate limiting: Implement request throttling

Testing

# Start the development server
uvicorn image_matchmaking_api:app --reload --log-level debug

# Test JSON endpoint with content page selection
curl -X POST "http://127.0.0.1:8000/api/v1/matchmaking/images/search" \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "IMAGE_MATCHMAKING",
    "criteria": [
      {
        "field": "Place",
        "value": "Basel*"
      }
    ],
    "from_date": "1600",
    "until_date": "1610",
    "maxResults": 3,
    "avoid_covers": true,
    "page_selection": "content"
  }'

# Test legacy form endpoint
curl -X POST "http://127.0.0.1:8000/api/v1/matchmaking/images/search/form" \
  -F "operation=IMAGE_MATCHMAKING" \
  -F "projectId=test" \
  -F "agentId=test" \
  -F "from_date=1600" \
  -F "until_date=1610" \
  -F "maxResults=2"

🚀 Performance Optimizations (v2.0)

The latest version includes comprehensive performance improvements based on a systematic 3-week optimization plan:

✅ Week 1: Intelligent Caching Layer

Manifest caching: LRU cache (1000 items) for IIIF manifest data - eliminates repeated API calls
Image validation caching: LRU cache (2000 items) for image accessibility checks
Cache management: Monitor hit rates and clear caches via API endpoints
Impact: 80-90% faster performance for subsequent requests

✅ Week 2: Concurrent Processing

Parallel record processing: ThreadPoolExecutor for multi-record requests
Configurable concurrency: Adjustable max_workers (default: 4) based on system resources
Smart batching: Optimal performance scaling for both single and bulk requests
Impact: 3-5x faster processing for multi-record searches

✅ Week 3: Optional Image Validation

Configurable validation: Skip image accessibility checks for speed (validate_images: false)
Smart defaults: Validation enabled by default to ensure image quality
Performance monitoring: Track validation impact and cache efficiency
Impact: 30-50% speed improvement when validation is disabled

Additional Performance Features

Smart Page Selection: Automatically skips book covers - 50-80% better image relevance
Enhanced Field Mapping: Case-insensitive matching reduces search failures
Robust Error Handling: Prevents cascading failures in bulk operations

Performance Monitoring

Check current performance status:

# Cache statistics
curl http://localhost:8000/api/v1/cache/stats

# Performance configuration  
curl http://localhost:8000/api/v1/performance/config

# Clear caches if needed
curl -X POST http://localhost:8000/api/v1/cache/clear

Testing Performance Improvements

Use the included test script:

python3 test_performance.py

Or the quick test launcher:

./quick_test.sh

Performance Impact Summary

First-time requests: 30-50% faster with optional validation disabled
Cached requests: 80-90% faster with manifest caching
Multi-record requests: 3-5x faster with concurrent processing
Image relevance: 50-80% improvement through smart page selection

Contributing

Follow the existing code structure and naming conventions
Add logging for new features using the configured logger
Include error handling and validation for new endpoints
Update this README for any API changes

License

This project interfaces with e-rara.ch, a service of the ETH Library. Please respect their terms of service and usage guidelines.

Preview of external content.

👋 Contact 💻 Source

Hackathons full of ideas, collaboration, and innovation are based on the premise of keeping the experience safe, inclusive, and respectful for everyone. We follow a clear Code of Conduct and support the Universal Declaration of Human Rights. Harassment or discrimination of any kind won't be tolerated—this applies to all staff, participants, coaches, visitors and sponsors. Please take a moment to review the full guidelines.

The contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License. The application that powers this site is available under the MIT license.

Previous
Hackathon Bern
Next project

Hackathon Bern