Turning 25.9 million raw events into kilobyte queries

A medallion pipeline on AWS (Bronze, Silver, Gold) that cut query latency 16× and data scanned ~400,000×, built as a graduate capstone with the trade-offs made explicit.

This was my capstone for DATA 516, the scalable-data-systems course in the University of Washington's data science master's. The brief was open-ended: take a firehose of raw e-commerce events and turn it into something a data scientist can actually query, then defend every architectural choice. The numbers below come from a real run over 25.9 million synthetic events, and the scope is at the end.

The problem

An e-commerce platform emits 500,000 to 750,000 events every five minutes (page views, cart actions, purchases, searches) that land in S3 as compressed JSONL. The raw data is reliable. Querying it is the problem. A simple question like "what is the conversion funnel by product?" means scanning hundreds of megabytes of nested logs, every time, and waiting. The five questions the business kept asking (conversion funnels, hourly revenue, top products, category performance, daily active users) each meant a slow, expensive full scan. The job was to make those answers cheap and fast without throwing away the raw record.

Bronze, Silver, Gold

I built it as a medallion architecture: three layers, each with one responsibility.

Bronze is the immutable landing zone, the raw JSONL exactly as it arrived, Hive-partitioned down to the minute. It is never edited. If a transformation turns out to be wrong, I can always reprocess from here.

Silver is the cleaned, typed layer. A single PySpark job on AWS Glue reads Bronze through the Glue Data Catalog, parses the timestamps into real datetimes, computes revenue on purchase events, drops records missing critical fields, and writes Parquet with Snappy compression, partitioned by date. Columnar Parquet is the first real lever, because Athena can prune columns and push predicates down instead of parsing whole JSON objects.

Gold is five pre-aggregated tables, one per business question: product funnel, hourly revenue, product popularity, category-by-day, and daily user activity. The expensive GROUP BYs run once, in the ETL, not on every query. A dashboard question that used to scan 420 MB now reads a few kilobytes.

The whole Bronze-to-Silver-to-Gold transform is one PySpark script that commits atomically. Either all three layers advance or the job fails and nothing is left half-written.

What the layering bought

I measured Bronze-direct queries against the Gold tables on the same data:

Query latency: 16.6× faster on average, 10.9s down to 0.66s.
Data scanned: roughly 412,000× less, about 420 MB down to 2 KB per query. At Athena's $5-per-terabyte rate, that is the difference between a few tenths of a cent per query and effectively free.
Storage: 14,779× smaller at the Gold layer, where 440 MB of Bronze becomes 29 KB of aggregates, because the questions only need the summary, not every row.

The trade is worth naming plainly. I moved cost from query time to ETL time, and took on the complexity of maintaining five aggregate tables. For an analytics workload where the same handful of questions get asked constantly, that is the right side of the trade. For ad-hoc exploration across the full event stream you would still go to Bronze, since the layering sits in front of the raw data rather than replacing it.

The two problems that actually cost me time

The architecture is textbook. The bugs were not.

The first was incremental processing. Glue job bookmarks track which files have already been handled so a re-run only touches new data. The catch is that when bookmarks find nothing new, the job receives an empty DataFrame and the first transformation throws on a column that "does not exist," so the pipeline fails on a no-op. The fix was to detect the empty case at the top and short-circuit to a clean job.commit(), making "no new data" a success rather than a crash. After that the incremental path behaved: a run with new data processed only the delta, and a run with nothing new exited in under a minute instead of reprocessing history.

The second was timestamps. The events carry full ISO-8601 strings with timezone offsets (2025-12-07T03:55:56.633003+00:00), and Athena's plain CAST(... AS TIMESTAMP) cannot parse them. I had to use FROM_ISO8601_TIMESTAMP in the Bronze-facing SQL and PySpark's to_timestamp in the Silver transform. A small thing that quietly produced nulls until I noticed, which is exactly the kind of data-quality bug that does not announce itself.

Reproducible by design

The entire stack (the S3 bucket with lifecycle rules, the Lambda event generator, the Glue job and crawlers, the Athena workgroup) is one CloudFormation template. Tearing it down and standing it back up is a single command, which mattered for working inside a student AWS account where idle resources cost real money. The generator's schedule ships disabled by default for the same reason, so nothing runs up a bill unless I turn it on.

Scope

The architecture is production-grade; the deployment is a capstone. The data is synthetic, generated by a Lambda, with no real storefront behind it. It ran inside a restricted student role, so a few steps a production version would automate are manual here, and I never load-tested it under real concurrent traffic. To take it further I would add data-quality monitoring beyond null checks, alerting and retries on the ETL, cost-budget alarms, and a real orchestration layer in place of manual triggers. The pattern itself (immutable raw, typed middle, pre-aggregated serving, all defined as code) is what I would carry into a production system unchanged.