From CSV to S3: Building an Airflow Data Pipeline with AI-Generated Building Blocks

The Problem

Your analytics team needs data from dozens of sources, but building pipelines takes forever. Every ETL job feels like starting from scratch:

DAG structure boilerplate
IAM permissions for S3 access
Error handling and retries
Testing that never gets written

The AI Solution

AI generates the building blocks - you assemble and refine. Here’s how I built an ETL pipeline in a day.

The Workflow

Requirements  → Source, transform, destination
AI Generation → DAG, Terraform, tests
Assembly      → Wire components together
Hardening     → Add error handling, retries
Testing       → Validate in staging

AI in Action: Example Prompts

Prompt 1: DAG Structure

"Create an Apache Airflow DAG that:
Downloads CSV from S3
Cleans data: drop NA, trim strings, parse dates
Adds processed_at timestamp
Uploads to S3 in different bucket
Validates output has rows
Use PythonOperator and S3 operators"

AI Output: Complete DAG from POC #3

def extract_data(**context):
    s3_client.download_file(S3_BUCKET, INPUT_KEY, '/tmp/input.csv')
    return '/tmp/input.csv'

def transform_data(**context):
    df = pd.read_csv(input_file)
    df = df.dropna()
    df['processed_at'] = datetime.now()
    df.to_csv(output_file, index=False)

Manual Refinement: Add error handling, retries, logging.

Prompt 2: Terraform for Airflow

"Write Terraform for:
- VPC with public subnet
- EC2 for Airflow
- Security group allowing 8080 (Airflow UI) and 22
- S3 bucket for data
- IAM role with S3 read/write"

AI Output: Basic infrastructure (POC #3)

Manual Refinement:

# Manually added versioning
resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration { status = "Enabled" }
}

# Manually added least-privilege IAM
policy = jsonencode({
  Statement = [{
    Effect = "Allow"
    Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
    Resource = [bucket_arn, "${bucket_arn}/*"]
  }]
})

Prompt 3: CI/CD Pipeline

"GitLab CI with:
- Lint stage: check Python syntax
- Test stage: pytest with mocking
- Deploy staging on develop branch
- Deploy production on main with approval"

AI Output: Pipeline structure (POC #3)

Manual Refinement: Add manual approval for production.

The Complete Pipeline

[CSV in S3] → [Airflow EC2] → [Transform] → [Processed CSV in S3]
                  ↓
           [Validate rows > 0]

Cost & Performance Notes

Component	Cost	Notes
t3.small EC2	~$30/month	Airflow + workers
S3 Storage	~$5/GB/month	Pay per use
Development time	2 days	AI-assisted

What This Means for Your Organization

Faster Data Products: Pipelines in days, not weeks
Maintainable: CI/CD ensures reliability
Secure: IAM follows least privilege
Scalable: Add new sources without rebuilding

Need to get data flowing? Let’s discuss your pipeline requirements.