A data pipeline is an automated system that moves and transforms data from sources to destinations on a schedule or in real time. It encompasses ETL/ELT processes, stream processing, and data orchestration. Tools include Apache Airflow, Dagster, Prefect, and managed services like Fivetran.

How Data Pipeline Works

A typical pipeline: every hour, Airflow triggers a DAG that extracts new orders from the production database, transforms them with dbt SQL models, loads aggregated metrics into the analytics warehouse, and sends a Slack notification if revenue drops below threshold.

Key Concepts

  • DAG (Directed Acyclic Graph) — Pipeline steps organized as a dependency graph — each step runs after its dependencies complete
  • Orchestration — Scheduling, monitoring, and managing pipeline execution — handling retries, failures, and dependencies
  • Stream vs Batch — Batch pipelines run on schedules (hourly/daily). Stream pipelines process events in real time (Kafka, Flink)

Frequently Asked Questions

When do I need a data pipeline?

When you need to regularly move, transform, or consolidate data from multiple sources. Manual CSV exports and one-off scripts don't scale — pipelines automate and monitor the process.