Picsha AI
Developer Documentation

The Picsha AI Asset Pipeline

This document outlines the full lifecycle of an asset on the Picsha AI platform—from the moment it hits our ingestion worker to its final optimized delivery to end-users globally via our CDN.

1. The Ingestion Phase (Processing & AI)

The moment an asset hits our S3 storage ingress, Picsha AI fires off an asynchronous task to our powerful Queue Worker (picsha-ai-ingest). This isolated Fargate node performs the heavy lifting necessary to sanitize assets for web delivery and extract deep semantic context.

Phase 1A: Security & Pre-Processing

  1. Magic Byte Analysis: We never trust file extensions or generic MIME types uploaded by web clients. We use a persistent exiftool-vendored background process to inspect literal file headers (magic bytes) to securely lock down the true format, alongside extracting EXIF data and embedded color profiles.
  2. Generative Object Removal (Optional): If an ingress API call passed generative_edit.remove parameters, our system executes an Amazon Titan Object Removal phase immediately. This physically replaces the source asset on S3 with the cleanly generated version prior to downstream indexing.

Phase 1B: Derivative Generation

Because web browsers natively support only a limited set of formats, we process diverse asset classes dynamically:

  • 📸 Photos & Complex Images

    • Web Images (JPEG/PNG): We generate a .webp optimized delivery version and an ultra-fast 150px grid thumbnail (thumb.webp).
    • Complex Images (HEIC, RAW, PSD, EPS, AI): Browsers cannot render these. We safely drop the image into an ImageMagick/LibRaw memory pipeline to decode and flatten it into a universally readable high-quality proxy.jpg. (ImageMagick seamlessly delegates EPS/PostScript rendering to Ghostscript). This proxy routes into the standard optimization pipeline to generate .webp variants and acts as the source of truth for all future delivery transformations.
  • 📄 Document Handling (PDFs, DOCX, TXT, MD)

    • Word & Markdown to PDF: We utilize a headless LibreOffice container (soffice --headless) to faithfully convert documents into standard PDFs. For Markdown (.md), we first parse it into styled HTML with GitHub-flavored typography before converting it via LibreOffice.
    • Poster Extraction: The first page of the document is rendered down into a high-res poster.jpg to serve as a visual cover image.
    • Text Extraction: The raw text within the document is scraped and securely buffered into memory.
  • 🎥 Video & Audio Handling

    • HLS Streaming: If the adaptive_stream flag is utilized, the video triggers a dedicated AWS Elemental MediaConvert workflow, transcoding massive mp4/mov files into staggered .m3u8 chunks (1080p, 720p, 480p) for seamless client buffering.
    • Cover Image: A snapshot from the video timeline is extracted and saved as the asset's visual poster.jpg.
    • Audio Extraction: The internal audio track is extracted as a temporary MP3 file for speech processing.

Phase 1C: Artificial Intelligence

With clean derivatives generated, we invoke our multimodal LLM architecture to construct the Search Engine indexing:

  1. Vision Analysis (Rekognition & Claude): Assets possessing a physical image derivative (Photos, Document Posters, Video Posters) are pushed to AWS Rekognition for bounding-box facial recognition (mapping into specific Collections) and moderation. Simultaneously, we pass the proxy to Amazon's Claude 3 Haiku to visually read and summarize the content.
  2. Transcript & Document AI (Sonnet): Audio is fed into Amazon Transcribe for full textual transcripts. The transcript (or extracted document text) is forwarded to Claude Sonnet via Bedrock to cleanly summarize massive multi-page payloads into actionable paragraphs.
  3. The Unified Multimodal Embedding (Titan): The "holy grail" of our Ingest architecture. We fuse the Proxy Image + the generated Text Summary/Transcript using the amazon.titan-embed-image-v1 multimodal model. This calculates a massive 1024-dimensional vector representing both visual pixels and textual context identically in mathematical space.

Phase 1D: Persistence

The final dimensions, EXIF data, GPS coordinates, textual transcripts, and multimodal vectors are pushed simultaneously to Neon (PostgreSQL) for metadata management, and AWS OpenSearch for blistering conversational vector Search Indexing.


2. The Delivery Phase (CloudFront & Lambda)

Asset delivery is managed by the picsha-ai-image-delivery serverless service. It employs an AWS API Gateway + AWS Lambda architecture sitting behind a globally distributed Amazon CloudFront CDN (cdn.picsha.ai).

Dynamic Sourcing and Transformation

When a request reaches the Lambda function (e.g., /render/{assetId}?wd=500&ht=500&crop=true), the service performs the following steps:

  1. Source Determination: It intelligently decides which S3 object to pull. For standard images, it pulls the original. For complex images (HEIC/RAW), it dynamically swaps the source to the assets/${id}/proxy.jpg generated during ingestion. For documents and videos, it pulls the poster.jpg.
  2. On-the-Fly Processing: Using the Sharp library, the Lambda applies all requested query parameters. This includes resizing, "face gravity" cropping, blurring, watermarking, text compositing, and even AI-driven generative editing via the MIMI orchestrator and AWS Bedrock for background removal.
  3. Response: The final transformed image is returned synchronously to the client as a Base64-encoded binary response (or a 302 Redirect to S3 if the original file is requested without transformations).

Sampling in the CDN (Caching Strategy)

To ensure the Lambda function is not overwhelmed and to provide sub-millisecond response times, we rely heavily on CloudFront edge caching.

Where and How We Sample:

  • CloudFront is configured to forward Query Strings (ForwardedValues: QueryString: true).
  • In the context of the CDN, sampling refers to caching the unique permutations of the image transformations. The "cache key" for CloudFront is the exact combination of the URL path AND the specific query parameters requested.
  • Example: A request for ?wd=800 is sampled and cached separately from a request for ?wd=800&blur=10.
  • TTL: Every unique sample generated by the Lambda is cached at the CloudFront edge locations for 86,400 seconds (1 Day) (DefaultTTL: 86400) and marked as immutable in the Cache-Control headers.

This architecture guarantees that the expensive compute required for complex transformations only happens once per unique parameter combination, after which it is served directly from the CDN edge node closest to the user.