Picsha AI
Developer Documentation

Ingestion Pipeline

The moment an asset hits our S3 storage ingress, Picsha AI fires off an asynchronous task to our powerful Queue Worker (picsha-ai-ingest). This isolated node performs heavy lifting: identifying magic bytes, rendering proxies, extracting text, and generating rich multimodal AI embeddings.

Because we process diverse asset classes ranging from flat images to 40MP RAW files, multi-page PDFs, and MP4 videos, the pipeline behaves dynamically depending on the detected MIME type.

Phase 1: Security & Detection

1. Magic Byte Analysis We never trust file extensions or the generic application/octet-stream MIME types uploaded by web clients. The moment an asset is ingested, we use ExifTool to inspect the literal file headers (magic bytes) to securely lock down its true format.

2. Generative Object Removal (Optional) If an ingress API call passed generative_edit.remove parameters, our system executes an Amazon Titan Object Removal phase immediately. This physically replaces the source asset on S3 with the cleanly generated version prior to any downstream indexing.

Phase 2: Derivative Generation

Every media type requires unique processing steps to sanitize it for ultra-fast web delivery.

📸 Standard & Complex Image Handling

  • Web Images (JPEG/PNG): We generate a .webp optimized web delivery version and an ultra-fast 150px grid thumbnail.
  • Complex Images (HEIC, RAW, PSD, EPS, AI): Browsers cannot render these. We first safely drop the image into an ImageMagick/LibRaw memory pipeline to extract the primary layer and flatten it into a universally readable high-quality JPEG Proxy. We then route that Proxy into the standard optimization pipeline to generate .webp variants.

📄 Document Handling (PDFs, DOCX, TXT)

  • PDF Conversion: If a .docx or raw text file is found, we immediately convert it into a standard web.pdf to ensure uniform cross-device viewer compatibility.
  • Poster Extraction: The first page of the document is rendered down into a high-res poster.jpg, which serves as a cover image for our grid views.
  • Text Extraction: The raw text within the document is scraped and securely buffered into memory.

🎥 Video & Audio Handling

  • HLS Streaming: If the adaptive_stream flag is utilized, the video triggers a dedicated AWS Elemental MediaConvert workflow. This transcodes the massive mp4/mov into staggered .m3u8 chunks (1080p, 720p, 480p) allowing for seamless buffering on the client.
  • Cover Image: A snapshot from the video timeline is extracted and saved as the asset's visual poster.jpg.
  • Audio Extraction: If the asset is a video, the internal audio track is permanently extracted and uploaded as a temporary MP3 file for speech processing.

Phase 3: Artificial Intelligence

With the clean derivatives generated, we invoke our multimodal LLM architecture to construct the Search Engine indexing.

1. Vision Analysis & Formatting (Rekognition & Claude) Any asset possessing a physical image derivative (Photos, Document Posters, Video Posters) is pushed to AWS Rekognition. This executes bounded-box inference scaling for facial recognition and moderation constraints. We simultaneously pass the proxy to Amazon's Claude 3 Haiku module to read and summarize the visual context natively.

2. Transcript & Document AI (Sonnet)

  • Videos/Audio: The extracted MP3 audio is fed into Amazon Transcribe, outputting a complete textual transcript of what was spoken.
  • Documents & Transcripts: The resulting text (or document text) is then forwarded to Claude 4.6 Sonnet to cleanly summarize the massive multi-page payloads into actionable paragraphs.

3. The Unified Multimodal Embedding (Titan) Finally, the "holy grail" of our Ingest architecture occurs. We take the Proxy Image + the generated Text Summary/Transcript and fuse them together within the Amazon Titan Multimodal Framework. This calculates a massive 1024-dimensional vector that represents both the visual pixels and the profound textual context identically in mathematical space.

Phase 4: Persistence

The final dimensions, EXIF data, GPS coordinates, textual transcripts, and multimodal vectors are pushed simultaneously to Neon (PostgreSQL) for user-metadata management, and AWS OpenSearch for blistering conversational Search Indexing.

If cache warming is requested via the API, the worker dispatches HTTP requests back through our CDN endpoints to guarantee sub-50ms cold-starts.