The Picsha AI Asset Pipeline
This document outlines the full lifecycle of an asset on the Picsha AI platform—from the moment it hits our ingestion worker to its final optimized delivery to end-users globally via our CDN.
1. The Ingestion Phase (Processing & AI)
The moment an asset hits our S3 storage ingress, Picsha AI fires off an asynchronous task to our powerful Queue Worker (picsha-ai-ingest). This isolated Fargate node performs the heavy lifting necessary to sanitize assets for web delivery and extract deep semantic context.
Phase 1A: Security & Pre-Processing
- Magic Byte Analysis: We never trust file extensions or generic MIME types uploaded by web clients. We use a persistent
exiftool-vendoredbackground process to inspect literal file headers (magic bytes) to securely lock down the true format, alongside extracting EXIF data and embedded color profiles. - Generative Object Removal (Optional): If an ingress API call passed
generative_edit.removeparameters, our system executes an Amazon Titan Object Removal phase immediately. This physically replaces the source asset on S3 with the cleanly generated version prior to downstream indexing.
Phase 1B: Derivative Generation
Because web browsers natively support only a limited set of formats, we process diverse asset classes dynamically:
-
📸 Photos & Complex Images
- Web Images (JPEG/PNG): We generate a
.webpoptimized delivery version and an ultra-fast 150px grid thumbnail (thumb.webp). - Complex Images (HEIC, RAW, PSD, EPS, AI): Browsers cannot render these. We safely drop the image into an ImageMagick/LibRaw memory pipeline to decode and flatten it into a universally readable high-quality
proxy.jpg. (ImageMagick seamlessly delegates EPS/PostScript rendering to Ghostscript). This proxy routes into the standard optimization pipeline to generate.webpvariants and acts as the source of truth for all future delivery transformations.
- Web Images (JPEG/PNG): We generate a
-
📄 Document Handling (PDFs, DOCX, TXT, MD)
- Word & Markdown to PDF: We utilize a headless LibreOffice container (
soffice --headless) to faithfully convert documents into standard PDFs. For Markdown (.md), we first parse it into styled HTML with GitHub-flavored typography before converting it via LibreOffice. - Poster Extraction: The first page of the document is rendered down into a high-res
poster.jpgto serve as a visual cover image. - Text Extraction: The raw text within the document is scraped and securely buffered into memory.
- Word & Markdown to PDF: We utilize a headless LibreOffice container (
-
🎥 Video & Audio Handling
- HLS Streaming: If the
adaptive_streamflag is utilized, the video triggers a dedicated AWS Elemental MediaConvert workflow, transcoding massivemp4/movfiles into staggered.m3u8chunks (1080p, 720p, 480p) for seamless client buffering. - Cover Image: A snapshot from the video timeline is extracted and saved as the asset's visual
poster.jpg. - Audio Extraction: The internal audio track is extracted as a temporary MP3 file for speech processing.
- HLS Streaming: If the
Phase 1C: Artificial Intelligence
With clean derivatives generated, we invoke our multimodal LLM architecture to construct the Search Engine indexing:
- Vision Analysis (Rekognition & Claude): Assets possessing a physical image derivative (Photos, Document Posters, Video Posters) are pushed to AWS Rekognition for bounding-box facial recognition (mapping into specific Collections) and moderation. Simultaneously, we pass the proxy to Amazon's Claude 3 Haiku to visually read and summarize the content.
- Transcript & Document AI (Sonnet): Audio is fed into Amazon Transcribe for full textual transcripts. The transcript (or extracted document text) is forwarded to Claude Sonnet via Bedrock to cleanly summarize massive multi-page payloads into actionable paragraphs.
- The Unified Multimodal Embedding (Titan): The "holy grail" of our Ingest architecture. We fuse the Proxy Image + the generated Text Summary/Transcript using the
amazon.titan-embed-image-v1multimodal model. This calculates a massive1024-dimensionalvector representing both visual pixels and textual context identically in mathematical space.
Phase 1D: Persistence
The final dimensions, EXIF data, GPS coordinates, textual transcripts, and multimodal vectors are pushed simultaneously to Neon (PostgreSQL) for metadata management, and AWS OpenSearch for blistering conversational vector Search Indexing.
2. The Delivery Phase (CloudFront & Lambda)
Asset delivery is managed by the picsha-ai-image-delivery serverless service. It employs an AWS API Gateway + AWS Lambda architecture sitting behind a globally distributed Amazon CloudFront CDN (cdn.picsha.ai).
Dynamic Sourcing and Transformation
When a request reaches the Lambda function (e.g., /render/{assetId}?wd=500&ht=500&crop=true), the service performs the following steps:
- Source Determination: It intelligently decides which S3 object to pull. For standard images, it pulls the original. For complex images (HEIC/RAW), it dynamically swaps the source to the
assets/${id}/proxy.jpggenerated during ingestion. For documents and videos, it pulls theposter.jpg. - On-the-Fly Processing: Using the
Sharplibrary, the Lambda applies all requested query parameters. This includes resizing, "face gravity" cropping, blurring, watermarking, text compositing, and even AI-driven generative editing via the MIMI orchestrator and AWS Bedrock for background removal. - Response: The final transformed image is returned synchronously to the client as a Base64-encoded binary response (or a 302 Redirect to S3 if the original file is requested without transformations).
Sampling in the CDN (Caching Strategy)
To ensure the Lambda function is not overwhelmed and to provide sub-millisecond response times, we rely heavily on CloudFront edge caching.
Where and How We Sample:
- CloudFront is configured to forward Query Strings (
ForwardedValues: QueryString: true). - In the context of the CDN, sampling refers to caching the unique permutations of the image transformations. The "cache key" for CloudFront is the exact combination of the URL path AND the specific query parameters requested.
- Example: A request for
?wd=800is sampled and cached separately from a request for?wd=800&blur=10. - TTL: Every unique sample generated by the Lambda is cached at the CloudFront edge locations for 86,400 seconds (1 Day) (
DefaultTTL: 86400) and marked asimmutablein theCache-Controlheaders.
This architecture guarantees that the expensive compute required for complex transformations only happens once per unique parameter combination, after which it is served directly from the CDN edge node closest to the user.