We Build Pipelines That Run Themselves
Autonomous data collection, real-time processing, multi-stage workflows, and AI-driven generation pipelines. Hundreds of scheduled jobs across production systems, running 24/7 without manual intervention.
Pipelines We've Built
Four real systems, four different pipeline architectures — all in production
Traffic Intelligence Pipeline
170+ automated jobs / real-time + batch processing / multi-region AWS
Real-time traffic data flows in from hardware sensors across multiple vendors — Bluetooth detectors, radar units, video analytics systems. Each vendor has a different API, different data format, different polling interval. We normalize all of it into a unified schema and run it through travel-time calculation, segment aggregation, and anomaly detection before it hits the database.
On top of the real-time layer, batch jobs run hourly and daily aggregations: FHWA-compliant travel time indices (TTI/PTI/BTI), near-miss safety metrics (TTC/PET), and EPA MOVES emissions estimates. Archive jobs compress and store raw data. Monitoring jobs watch the whole thing and alert when collectors go silent.
Intelligence Collection Pipeline
167 automated collectors / knowledge graph ingestion / LLM inference chains
A distributed intelligence system that autonomously collects, processes, and connects information across domains. 167 collectors run on scheduled cycles — scraping, polling APIs, monitoring feeds, and ingesting structured data. Each collector normalizes its output into a common signal format tagged with source, confidence, and domain metadata.
Incoming signals feed into a multi-stage enrichment pipeline: entity extraction, relationship mapping into a Neo4j knowledge graph, vector embedding into Qdrant for semantic search, and time-series storage in TimescaleDB. LLM inference runs on local GPU hardware, producing summaries, assessments, and cross-domain correlation reports on 4-hour autonomous cycles.
Decision Engine Pipeline
Multi-source aggregation / scheduled analysis / LLM-augmented recommendations
A personal command center that aggregates data from multiple life domains — journal entries, habit tracking, financial transactions, goals, project status, and external feeds. Incoming data flows through validation and normalization before landing in domain-specific stores.
Scheduled analysis jobs run pattern detection across domains: correlating habits with goal progress, tracking financial trends against budgets, and surfacing tasks that are falling behind. An LLM layer generates weekly summaries and actionable recommendations based on the cross-domain analysis. Everything runs on self-hosted infrastructure with zero external data custody.
Autonomous Code Generation Pipeline
16-state workflow / 3 async queues / proposal to tested branch
An AI-driven development pipeline that accepts build proposals and runs them through a 16-state process to produce tested code. Proposals enter a planning queue where an LLM breaks them into implementation steps. Each step flows through a generation queue that writes code on isolated Git branches, then into a validation queue that runs tests and static analysis.
Three Celery queues handle the different stages with different concurrency and priority settings — planning is serial, generation is parallel across proposals, validation runs with dedicated resources. Failed validations loop back to generation with error context. Successful builds produce tested branches ready for human review. The whole pipeline runs autonomously once a proposal is submitted.
How We Build Pipelines
Celery + Redis for Everything Async
Every pipeline we build runs on Celery with Redis as the broker. Scheduled tasks, event-driven triggers, priority queues, retry logic with exponential backoff, and dead-letter handling. We know this stack deeply and we push it hard.
Normalize Early, Enrich Later
Raw data hits a staging layer first. Validation, schema normalization, and deduplication happen before anything touches the main database. Enrichment — entity extraction, embedding, aggregation — runs as separate downstream jobs that can fail without losing the source data.
Every Job Is Observable
Every pipeline job logs its start, completion, duration, and record count. Monitoring jobs watch for silent failures — collectors that stop collecting, jobs that run longer than expected, queues that back up. We know something is broken before users notice.
Self-Hosted When It Matters
Pipelines that handle sensitive data or need GPU compute run on hardware we own and maintain. No third-party data custody, no variable cloud costs for inference workloads. AWS for what makes sense, bare metal for what doesn't.
Need a Pipeline Built?
Whether it's real-time data collection, batch ETL, AI-driven workflows, or something we haven't seen yet — we can build it.
contact@legacycoder.com
Phone
+1 (720) 767-3986
Location
Denver, CO