Picsha AI
Developer Documentation

Picsha AI Image Delivery Primer

This document provides a comprehensive overview of how the Picsha AI platform handles the entire lifecycle of an asset, from the moment of upload (ingestion) to its final optimized delivery to end-users globally via our CDN.

1. The Ingestion Phase (What We Do on Upload)

When an asset is uploaded to Picsha, it triggers an asynchronous job in the picsha-ai-ingest Fargate worker. Because web browsers natively support only a limited set of image and document formats, the ingestion phase is primarily responsible for creating web-compatible "proxies" and optimized samples of the original file.

AI-Powered Extraction & Analysis

Before rendering previews, the asset undergoes deep metadata extraction and AI analysis:

  • Metadata via ExifTool: A persistent exiftool-vendored background process pool accurately extracts EXIF data, embedded color profiles, and true MIME types (using magic-byte detection) even if the file lacks an extension.
  • Object & Facial Analysis (AWS Rekognition): To ensure API compatibility, the ingest worker temporarily scales down large files to compliant JPEGs before sending them to AWS Rekognition. This extracts object labels and maps detected faces into specific Rekognition Collections.
  • Scene Analysis (Amazon Titan): For image embeddings, we use the amazon.titan-embed-image-v1 multimodal model. This generates rich vector embeddings used by OpenSearch for semantic queries.
  • AI Summaries (Claude Sonnet): We pass the extracted textual content (or proxy image) to Anthropic's Claude Sonnet (anthropic.claude-sonnet-4-6 via Bedrock) to generate robust semantic summaries.

Making Samples of Photos

We process images using Sharp and ImageMagick depending on their complexity. If the "Pre-render Sizes" toggle is enabled during ingest (passed as "pre_render_sizes": true in the API payload), the ingest worker immediately generates a standard set of derivative samples:

  1. proxy.jpg (The Universal Proxy): For "complex" media formats that browsers cannot render directly (e.g., HEIC/HEIF from iPhones, RAW files like CR2/NEF, and design files like PSD, AI, EPS), the ingest pipeline uses ImageMagick to flatten and decode the file into a high-quality JPEG. (Note: While legacy systems explicitly invoked Ghostscript for EPS parsing, the modern ingestion pipeline calls ImageMagick, which seamlessly delegates EPS and PostScript rendering to the underlying Ghostscript engine). This proxy acts as the source of truth for all future delivery transformations.
  2. optimized.webp: A web-ready, highly compressed delivery format optimized for fast loading on standard screens.
  3. thumb.webp: A lightweight, 150px preview used extensively in the platform's grid layouts and UI.

Creating PDF Representations of Docs and Markdown Files

For non-image text assets, the goal is to generate a visual, browser-viewable "poster" proxy so users can preview the document natively in the UI without downloading it. The DocumentService handles this conversion:

  • Word Documents (.docx, .doc, .rtf): We utilize a headless LibreOffice container (soffice --headless --convert-to pdf) to faithfully convert the document into a PDF representation.
  • Markdown Files (.md): Markdown files undergo a specialized pipeline. First, the raw text is parsed using marked into styled HTML, complete with GitHub-flavored typography, code blocks, and layout wrappers. Once the HTML document is generated, it is passed into LibreOffice to convert the styled HTML into a polished, visual PDF proxy.

These PDF proxies are ultimately surfaced to the frontend as the visual representation (poster.jpg) of the textual asset.


2. The Delivery Phase (CloudFront & Lambda)

Asset delivery is managed by the picsha-ai-image-delivery serverless service. It employs an AWS API Gateway + AWS Lambda architecture sitting behind a globally distributed Amazon CloudFront CDN (cdn.picsha.ai).

Dynamic Sourcing and Transformation in Lambda

When a request reaches the Lambda function (e.g., /render/{assetId}?wd=500&ht=500&crop=true), the imageDeliveryService performs the following steps:

  1. Source Determination: It intelligently decides which S3 object to pull. For standard images, it pulls the original. For complex images (HEIC/RAW), it dynamically swaps the source to the assets/${id}/proxy.jpg generated during ingestion. For documents and videos, it pulls the poster.jpg.
  2. On-the-Fly Processing: Using the Sharp library, the Lambda applies all requested query parameters. This includes resizing, "face gravity" cropping, blurring, watermarking, text compositing, and even AI-driven generative editing via the MIMI orchestrator.
  3. Response: The final transformed image is returned synchronously to the client as a Base64-encoded binary response (or a 302 Redirect to S3 if the original file is requested without transformations).

Sampling in the CDN (Caching Strategy)

To ensure the Lambda function is not overwhelmed and to provide sub-millisecond response times to end-users, we rely heavily on CloudFront edge caching.

Where and How We Sample:

  • CloudFront is configured to forward Query Strings (ForwardedValues: QueryString: true).
  • In the context of the CDN, sampling refers to caching the unique permutations of the image transformations. The "cache key" for CloudFront is the exact combination of the URL path AND the specific query parameters requested.
  • Example: A request for ?wd=800 is sampled and cached separately from a request for ?wd=800&blur=10.
  • TTL: Every unique sample generated by the Lambda is cached at the CloudFront edge locations for 86,400 seconds (1 Day) (DefaultTTL: 86400) and marked as immutable in the Cache-Control headers.

This architecture guarantees that the expensive compute required for complex transformations (or retrieving proxy.jpg files) only happens once per unique parameter combination, after which it is served directly from the CDN edge node closest to the user.