Engineering Guide

How to Start Building a Custom ETL in Scala

Scala is a strong ETL choice when you want Spark-native development, type safety, and an implementation that can grow from a single batch job into a larger data platform.

When Scala Makes Sense for ETL

Scala is most valuable when your ETL logic is complex enough that strong abstractions, reusable libraries, and compile-time checks save you real maintenance effort. It is especially attractive when your runtime is Apache Spark, because Spark's core APIs are native to the JVM and first-class in Scala.

Use Scala when you expect the pipeline codebase to behave like software, not just like a collection of scripts.

Start with the Right Shape

Core layers

  • - Configuration: input paths, secrets, environment flags, target tables
  • - Readers: JDBC, files, message queues, or API ingestion adapters
  • - Transforms: normalization, enrichment, deduplication, and business rules
  • - Validators: schema checks, row-count checks, and domain assertions
  • - Writers: lake, warehouse, or service-specific output adapters

Early design goals

  • - Idempotent loads so reruns do not duplicate data
  • - Structured logging with batch or partition identifiers
  • - Clear failure boundaries between extract, transform, and load stages
  • - Small, testable transformations instead of giant monolithic jobs
  • - A deployment model that matches the runtime from day one

Project Scaffold

A minimal Spark-based ETL project can start with build.sbt, a single main class, and a conventional src/main/scala layout. Replace the example Spark and Scala versions below so they match your target cluster runtime.

name := "custom-etl"

version := "0.1.0"

scalaVersion := "2.13.17"

val sparkVersion = "4.1.2" // Replace to match your target cluster

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "com.typesafe" % "config" % "1.4.3",
  "org.scalatest" %% "scalatest" % "3.2.19" % Test
)

A Minimal Job Skeleton

Even a small starter job should accept input and output locations as arguments, construct a SparkSession, apply a narrow transformation chain, and write to a deterministic target format such as Parquet.

package dev.ryware.etl

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}

object CustomEtlJob {
  def main(args: Array[String]): Unit = {
    val inputPath = args(0)
    val outputPath = args(1)

    val spark = SparkSession.builder()
      .appName("custom-etl-job")
      .getOrCreate()

    val curated = spark.read
      .option("header", "true")
      .csv(inputPath)
      .filter(col("transaction_id").isNotNull)
      .withColumn("amount", col("amount").cast("decimal(18,2)"))
      .withColumn("updated_at", to_timestamp(col("updated_at")))

    curated.write
      .mode("overwrite")
      .parquet(outputPath)

    spark.stop()
  }
}

Production Practices to Add Next

Validation and quality

  • - Add schema assertions before writing curated outputs
  • - Store row counts and null rates per run for anomaly detection
  • - Reject or quarantine malformed partitions instead of silently coercing everything

Testing and packaging

  • - Unit test transformation functions with small in-memory DataFrames
  • - Add integration tests for source and sink adapters
  • - Package with sbt and run with spark-submit or the cluster's native launcher

A Sensible First Milestone

Your first goal should not be "build the whole data platform." It should be: read one source reliably, apply one clean transformation path, write one curated output, and make the run observable. Once that path is repeatable, everything else becomes safer to add.

Milestone 1

Single batch job with deterministic outputs

Milestone 2

Configuration-driven environments and validation rules

Milestone 3

Orchestration, alerts, lineage, and quality baselines

© 2026 - Ryware.