Data Pipeline
Data Pipeline
Systems that handle external data reduce to similar steps everywhere. Fetch the data, clean it, store it, and expose it.
1. ETL and ELT
| Acronym | Expansion | Flow |
|---|---|---|
| ETL | Extract → Transform → Load | Transform before loading into the data warehouse |
| ELT | Extract → Load → Transform | Load first, transform inside the warehouse |
ETL emerged in the 1970s and 80s along with the rise of data warehouses. Storage was expensive at the time, so cleansing before load made sense. After 2010, with the lower storage and compute costs of cloud warehouses (Snowflake, BigQuery, Redshift), the ELT pattern spread widely. dbt (data build tool, 2016) made the transformation step structured in SQL and pushed ELT into the mainstream.
Selection criteria:
- Whether raw data is worth preserving as is (favors ELT).
- Whether transformations are expressible in SQL (ELT).
- Whether PII must be masked before load (ETL).
- Whether the warehouse's compute cost can absorb transformation cost (ELT).
2. Generalization into four steps
External data flows split broadly into four steps.
① Extract APIs, files, crawling, CDC
② Transform format conversion, missing/outlier values, normalization, dedup
③ Load operational DB, data lake, warehouse
④ Serve APIs, search, dashboards, recommendations
Validation, observation, and retry slot in between each step. Failure modes differ per step, so it is safer to split responsibilities so they do not pile in one place.
3. Direct call vs queue-based
Direct call (synchronous) — extract, transform, and load run sequentially in one process. Simple and easy to debug. The limit is that delay in one step blocks the whole, and the system is bound by external API behavior.
Queue-based (asynchronous) — a queue (Kafka, SQS, RabbitMQ, Redis Streams) sits between each step.
[Extractor] → topic.raw → [Transformer] → topic.clean → [Loader]
Advantages:
- Independent deployment and scaling per step.
- Reprocessing (replay) is possible — read the topic again.
- Absorbs traffic spikes.
Limits:
- Operational complexity rises.
- Assumptions about message order, duplication, and delay must be explicit.
The choice follows data volume, real-time requirement, and team size.
4. Idempotency and delivery guarantees
In distributed systems, message delivery has three meanings.
| Guarantee | Meaning |
|---|---|
| at-most-once | Once or zero times. Loss possible. |
| at-least-once | At least once. Duplicates possible. |
| exactly-once | Exactly once. Implementation cost is large and assumptions must be narrow to hold. |
Most queues guarantee at-least-once by default. The promise that the consumer must be idempotent follows from this. Make it so the same message processed twice yields the same result.
Common ways to gain idempotency:
- Use the natural key (the external system's unique ID) as PK →
INSERT ... ON CONFLICT DO NOTHING. - Assign an idempotency key to messages → block duplicates with a processing-record table.
- Upsert pattern → applying twice yields the same result.
Kafka's transactional and idempotent producer options reduce duplication on the producer side (full exactly-once holds only within Kafka topics). More in kafka-when.
5. Orchestration tools
Tools that visualize multiple steps and tie scheduling, retries, and dependencies together.
Apache Airflow (2014, Airbnb) — the most widely used workflow tool. Define DAGs (directed acyclic graphs) in Python code. Strengths are rich operators (Hive, Spark, S3, BigQuery, etc.) and a large community. Limits include a heavy scheduler and awkward dynamic workflows.
Dagster (2018) — asset-centric model. The theme is "what data are we producing." It has a more organized structure for types, tests, and observability, by many accounts.
Prefect (2019) — Pythonic API and dynamic flow. Emphasizes a flow closer to code than Airflow. A cloud hosting option is also available.
Mage (2022) — treats data pipelines as notebooks and blocks. UI friendliness is a feature.
The choice depends on team size, existing infrastructure, real-time needs, and observability needs. Plenty of places need only cron plus a script.
6. Preserve raw, separate processed
Separate raw schemas from processed (clean/curated) schemas. Keep the raw without modification and produce processed results separately. When transformation logic changes, reprocess the raw to refresh results (reproducibility).
7. CDC, backfill, data quality
CDC (Change Data Capture) — extract changes from the operational DB (WAL, binlog) and feed them into other systems. Debezium comes up often. Lower load and latency than polling.
Backfill — applying new transformation logic to past data. In queue-based pipelines we replay topics from the beginning, or specify a start point in storage to reprocess.
Data quality — tools like great_expectations (2018), dbt tests, and Soda codify the validation step. Rules like "alert if row count is 30% lower than usual" are watched by the system, not by hand.
8. Common pitfalls
Silent external API changes — when the response schema slips, loaded data shape breaks. Place schema validation right before load.
Inconsistent timezone assumptions — if external systems' times are unclear (KST or UTC?), aggregations drift. It is safer to normalize to UTC at load time.
Duplicate loads — when the same ID lands twice, counts and aggregates inflate. Block with natural key PK and upsert.
Reprocessing side effects — backfills mixed with real-time stats can show users temporary inconsistencies. A gate between load and serve (e.g. materialized view refresh) helps.
Scheduler single point of failure — if a single node's cron fails, the pipeline halts. Consider multiple workers plus a distributed lock or a managed scheduler.
Closing thoughts
A data pipeline becomes half easier to operate once steps are split. We can trace where data went wrong step by step. Tools can start with cron plus a Python script. As long as records and reproducibility are kept, tools can be swapped later.
Next
- kafka-when
- pgvector-rag
References: Apache Airflow official, Dagster official, Prefect official, dbt official, Debezium official, Designing Data-Intensive Applications.