Unit 2 - Notes

INT332 8 min read

Unit 2: Image Building & Container Management

1. Dockerfile Core Concepts

A Dockerfile is a text document containing all the commands a user could call on the command line to assemble an image. Docker reads these instructions to build images automatically.

Image Layering

Definition: A Docker image is built up from a series of layers. Each layer represents an instruction in the image's Dockerfile.
Read-Only Layers: All layers making up the image are read-only. When a container is launched from the image, Docker adds a thin read-write layer on top of the image layers.
Caching: Docker caches these layers during the build process. If a layer hasn't changed (e.g., the underlying code/instruction is the same), Docker reuses the cached layer, significantly speeding up the build process.
Union File System (UnionFS): Docker uses UnionFS to combine these separate layers into a single cohesive file system.

Build Context & `.dockerignore`

Build Context: When you run docker build, the first thing Docker does is send the entire directory (the "context") to the Docker daemon. This includes all files and subdirectories.
.dockerignore File: To prevent sending unnecessary files (like .git, node_modules, or secret environment files) to the daemon, you use a .dockerignore file.
- Benefits: Increases build speed, reduces image size, and prevents accidental leakage of sensitive files into the image.

Basic Dockerfile Instructions

FROM: Initializes a new build stage and sets the Base Image. Must be the first non-comment instruction (e.g., FROM ubuntu:20.04).
WORKDIR: Sets the working directory for any subsequent RUN, CMD, ENTRYPOINT, COPY, and ADD instructions.
COPY: Copies files or directories from the host's build context into the container's filesystem. (e.g., COPY . /app).
ADD: Similar to COPY but has additional features: it can extract local tar files automatically and can download files from URLs. Best practice is to use COPY unless these specific features are needed.
RUN: Executes commands in a new layer and commits the results. Used for installing packages and setting up the environment (e.g., RUN apt-get update && apt-get install -y python3).
ENV: Sets environment variables that persist in the final image and running containers (e.g., ENV PORT=8080).
EXPOSE: Documents which ports are intended to be published. Note: It does not actually publish the port; it acts as documentation between the image builder and the container runner.
VOLUME: Creates a mount point and marks it as holding externally mounted volumes. Useful for persisting data.
CMD: Provides default arguments or commands for an executing container. Can be overridden from the command line. Only one CMD is allowed per Dockerfile.
ENTRYPOINT: Configures a container that will run as an executable. Unlike CMD, command-line arguments passed to docker run are appended to the ENTRYPOINT command, making it harder to override.

Example Dockerfile:

DOCKERFILE

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV FLASK_APP=app.py
EXPOSE 5000
CMD ["flask", "run", "--host=0.0.0.0"]

2. Image Creation in Detail

The `docker build` Process

CLI to Daemon: The Docker CLI parses the docker build command and sends the build context to the Docker daemon.
Step-by-Step Execution: The daemon executes the Dockerfile instructions one by one.
Intermediate Containers: For each instruction, Docker creates a temporary container, runs the instruction, saves the result as a new image layer, and then removes the temporary container.
Final Image: Once all steps are completed, the final layer becomes the built image.

Image Tagging / Versioning

Concept: Tags are human-readable aliases for image IDs, used to version images.
Format: repository:tag (e.g., nginx:1.21.6). If no tag is specified, Docker defaults to latest.
Command: docker tag <source_image> <target_image>
- Example: docker tag myapp:latest myregistry.com/myapp:v1.0.0
Best Practices: Avoid relying solely on the latest tag in production, as it can lead to unpredictable deployments. Always use semantic versioning or commit hashes.

Inspecting Images

docker inspect <image_name>: Returns a detailed JSON array containing all metadata about the image, including architecture, OS, size, environment variables, entrypoints, and the layers it comprises.
docker history <image_name>: Shows the history of the image, detailing the size and the exact Dockerfile instruction that created each layer. Very useful for debugging large image sizes.

3. Docker Networking

Docker handles communication between containers and the outside world through its networking subsystem.

Bridge Network

Default Bridge: By default, containers are connected to the default bridge network. Containers on this network can communicate via IP address.
User-Defined Bridge: It is a best practice to create custom bridge networks (docker network create my-net). Containers on user-defined bridges can resolve each other by container name or alias, providing automatic DNS resolution.

Host & Overlay Networks

Host Network: (--network host) Removes network isolation between the container and the Docker host. The container shares the host's networking namespace, meaning it uses the host's IP and ports directly.
Overlay Network: Used in Docker Swarm or multi-host environments. It allows containers running on different physical Docker hosts to communicate securely as if they were on the same local network.

DNS Inside Docker

Docker provides an embedded DNS server.
For containers on user-defined networks, this DNS server resolves container names to their respective IP addresses automatically.
External DNS requests are forwarded to the DNS servers configured on the Docker host.

Linking Containers

Legacy --link: An older mechanism to allow containers to discover each other securely and pass environment variables.
Modern Approach: --link is deprecated. The modern, robust way to link containers is to place them on the same user-defined bridge network, allowing them to communicate via standard DNS resolution using container names.

Port Mapping

By default, containers are isolated and their internal ports are not accessible from the host or outside world.
Syntax: -p <Host_Port>:<Container_Port> (e.g., docker run -p 8080:80 nginx)
This creates a firewall rule that forwards traffic from port 8080 on the host machine to port 80 inside the container.

4. Docker Storage

Containers are ephemeral; when a container is deleted, its writable layer is lost. Docker provides mechanisms to persist data.

Volumes vs Bind Mounts

Volumes:
- Managed entirely by Docker.
- Stored in a part of the host filesystem which is managed by Docker (/var/lib/docker/volumes/ on Linux).
- Isolated from the core functionalities of the host machine.
- Best for: Persisting database data, sharing data between containers, and backups.
Bind Mounts:
- Dependent on the directory structure of the host machine.
- Mounts a specific file or directory from the host into the container.
- Best for: Development environments (e.g., mounting source code into a container so changes on the host immediately reflect inside the container).

Backing Data on Host

When you mount data to the host, you ensure the lifecycle of the data is independent of the container lifecycle.

Volume creation: docker volume create my-vol
Usage: docker run -v my-vol:/app/data my-image

Copy-on-Write (CoW) Mechanism

How it works: When a container needs to modify a file that exists in an underlying read-only image layer, Docker uses CoW. It copies the file from the read-only layer up into the container's thin writable layer, and then makes the modification there.
Benefit: This maximizes storage efficiency and minimizes container startup time, as images can be shared across multiple containers without duplication, and only modified files take up extra disk space.

5. Registries

A Docker registry is a stateless, highly scalable server-side application that stores and lets you distribute Docker images.

Docker Hub

The default, public registry maintained by Docker.
Contains millions of official and community-contributed images.
When you run docker pull ubuntu, Docker defaults to downloading from Docker Hub.

GitHub Container Registry (GHCR)

Integrated tightly with GitHub Actions and GitHub packages.
Uses the domain ghcr.io.
Provides granular permissions tied to GitHub repositories and user accounts, making it excellent for CI/CD pipelines hosted on GitHub.

Private Registries

Organizations often host their own private registries to keep proprietary images secure and to reduce bandwidth usage.
Can be hosted using cloud providers (AWS ECR, Azure ACR, Google GCR), third-party tools (Harbor, JFrog), or by running the official open-source registry Docker image locally.

Authentication & Access Tokens

Authentication: To push or pull from private registries, users must authenticate using docker login <registry-url>.
Access Tokens (PATs): Instead of using account passwords, it is highly recommended (and often required by platforms like GitHub and Docker Hub) to use Personal Access Tokens.
- Security: Tokens can be scoped with specific permissions (e.g., read-only, write-only) and can be easily revoked if compromised.
- CI/CD: In automated pipelines, these tokens are stored as encrypted secrets and passed to the Docker CLI to authorize image pushing/pulling during the build process.

Unit 1

Unit 3