Engineering Guide

How to Start Building a Custom ETL in Scala

Scala is a strong ETL choice when you want Spark-native development, type safety, and an implementation that can grow from a single batch job into a larger data platform.

Build a Custom ETL View ETL Hub

When Scala Makes Sense for ETL

Scala is most valuable when your ETL logic is complex enough that strong abstractions, reusable libraries, and compile-time checks save you real maintenance effort. It is especially attractive when your runtime is Apache Spark, because Spark's core APIs are native to the JVM and first-class in Scala.

Use Scala when you expect the pipeline codebase to behave like software, not just like a collection of scripts.

Start with the Right Shape

Core layers

- Configuration: input paths, secrets, environment flags, target tables
- Readers: JDBC, files, message queues, or API ingestion adapters
- Transforms: normalization, enrichment, deduplication, and business rules
- Validators: schema checks, row-count checks, and domain assertions
- Writers: lake, warehouse, or service-specific output adapters

Early design goals

- Idempotent loads so reruns do not duplicate data
- Structured logging with batch or partition identifiers
- Clear failure boundaries between extract, transform, and load stages
- Small, testable transformations instead of giant monolithic jobs
- A deployment model that matches the runtime from day one

Project Scaffold

A minimal Spark-based ETL project can start with build.sbt, a single main class, and a conventional src/main/scala layout. Replace the example Spark and Scala versions below so they match your target cluster runtime.

name := "custom-etl"

version := "0.1.0"

scalaVersion := "2.13.17"

val sparkVersion = "4.1.2" // Replace to match your target cluster

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "com.typesafe" % "config" % "1.4.3",
  "org.scalatest" %% "scalatest" % "3.2.19" % Test
)

A Minimal Job Skeleton

Even a small starter job should accept input and output locations as arguments, construct a SparkSession, apply a narrow transformation chain, and write to a deterministic target format such as Parquet.

package dev.ryware.etl

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}

object CustomEtlJob {
  def main(args: Array[String]): Unit = {
    val inputPath = args(0)
    val outputPath = args(1)

    val spark = SparkSession.builder()
      .appName("custom-etl-job")
      .getOrCreate()

    val curated = spark.read
      .option("header", "true")
      .csv(inputPath)
      .filter(col("transaction_id").isNotNull)
      .withColumn("amount", col("amount").cast("decimal(18,2)"))
      .withColumn("updated_at", to_timestamp(col("updated_at")))

    curated.write
      .mode("overwrite")
      .parquet(outputPath)

    spark.stop()
  }
}

Production Practices to Add Next

Validation and quality

- Add schema assertions before writing curated outputs
- Store row counts and null rates per run for anomaly detection
- Reject or quarantine malformed partitions instead of silently coercing everything

Testing and packaging

- Unit test transformation functions with small in-memory DataFrames
- Add integration tests for source and sink adapters
- Package with sbt and run with spark-submit or the cluster's native launcher

A Sensible First Milestone

Your first goal should not be "build the whole data platform." It should be: read one source reliably, apply one clean transformation path, write one curated output, and make the run observable. Once that path is repeatable, everything else becomes safer to add.

Milestone 1

Single batch job with deterministic outputs

Milestone 2

Configuration-driven environments and validation rules

Milestone 3

Orchestration, alerts, lineage, and quality baselines

Related ETL Guides

Continue with platform selection, anomaly detection, or AWS Glue-specific implementation options.

Back to ETL hub

Platform Comparison