← Back to Blog
From CSV to S3: Building an Airflow Data Pipeline with AI-Generated Building Blocks
The Problem
Your analytics team needs data from dozens of sources, but building pipelines takes forever. Every ETL job feels like starting from scratch:
- DAG structure boilerplate
- IAM permissions for S3 access
- Error handling and retries
- Testing that never gets written
The AI Solution
AI generates the building blocks - you assemble and refine. Here’s how I built an ETL pipeline in a day.
The Workflow
1. Requirements → Source, transform, destination
2. AI Generation → DAG, Terraform, tests
3. Assembly → Wire components together
4. Hardening → Add error handling, retries
5. Testing → Validate in staging
AI in Action: Example Prompts
Prompt 1: DAG Structure
"Create an Apache Airflow DAG that:
1. Downloads CSV from S3
2. Cleans data: drop NA, trim strings, parse dates
3. Adds processed_at timestamp
4. Uploads to S3 in different bucket
5. Validates output has rows
Use PythonOperator and S3 operators"
AI Output: Complete DAG from POC #3
def extract_data(**context):
s3_client.download_file(S3_BUCKET, INPUT_KEY, '/tmp/input.csv')
return '/tmp/input.csv'
def transform_data(**context):
df = pd.read_csv(input_file)
df = df.dropna()
df['processed_at'] = datetime.now()
df.to_csv(output_file, index=False)
Manual Refinement: Add error handling, retries, logging.
Prompt 2: Terraform for Airflow
"Write Terraform for:
- VPC with public subnet
- EC2 for Airflow
- Security group allowing 8080 (Airflow UI) and 22
- S3 bucket for data
- IAM role with S3 read/write"
AI Output: Basic infrastructure (POC #3)
Manual Refinement:
# Manually added versioning
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration { status = "Enabled" }
}
# Manually added least-privilege IAM
policy = jsonencode({
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
Resource = [bucket_arn, "${bucket_arn}/*"]
}]
})
Prompt 3: CI/CD Pipeline
"GitLab CI with:
- Lint stage: check Python syntax
- Test stage: pytest with mocking
- Deploy staging on develop branch
- Deploy production on main with approval"
AI Output: Pipeline structure (POC #3)
Manual Refinement: Add manual approval for production.
The Complete Pipeline
[CSV in S3] → [Airflow EC2] → [Transform] → [Processed CSV in S3]
↓
[Validate rows > 0]
Cost & Performance Notes
| Component | Cost | Notes |
|---|---|---|
| t3.small EC2 | ~$30/month | Airflow + workers |
| S3 Storage | ~$5/GB/month | Pay per use |
| Development time | 2 days | AI-assisted |
What This Means for Your Organization
- Faster Data Products: Pipelines in days, not weeks
- Maintainable: CI/CD ensures reliability
- Secure: IAM follows least privilege
- Scalable: Add new sources without rebuilding
Need to get data flowing? Let’s discuss your pipeline requirements.