Docker (and HPC)

Peter van Heusden (pvh@sanbi.ac.za) and
Eugene de Beste, SANBI

"The trick is to build a fast system" - Seymour Cray

Reminder: High Performance Computing means optimising the whole computing system.

Remember Amdahl's Law: the theoretical speedup of a task is always limited by the part of the task that cannot benefit from the improvement.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." - Tony Hoare

Variant calling pipeline. Image source: https://github.com/common-workflow-language/workflows/tree/h3abionet-gatk-workflow/workflows/GATK

Solve it in a way that is portable and can be adapted as software changes with a handful of developers

					
class: Workflow
cwlVersion: v1.0

inputs:
  reference:
    type: File
    doc: reference human genome file

steps:

  create-dict:
    run: ../../tools/picard-CreateSequenceDictionary.cwl
    in:
      reference: reference
      outputFileName: output_RefDictionaryFile
      tmpdir: tmpdir
out: [ createDict_output ]
					
				

From CWL GATK workflow

  • Virtualisation is a key tool in modern computing infrastructure for:
    • Scalability
    • Density (high utilisation)
    • Security
  • Hypervisor based virtualisation achieves this at the cost of performance
Hypervisors vs Containers, original source unknown
  • Docker container isolation based on control groups (cgroups) and namespaces
  • Access to hardware is, in general, through same kernel interfaces as normal userspace code
  • Benefits from technology (e.g. Open vSwitch) designed to expose hardware virtualisation support
  • Device nodes can be exported to container (but: security?)

				pvh@gabber:~$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world

c04b14da8d14: Pull complete
Digest: sha256:0256e8a36e2070f7bf2d0b0763dbabdd67798512411de4cdcf9431a1feb60fd9
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
				

Things to explore:

  • Where are my processes?
  • Docker networking
  • Linux images: ubuntu and alpine
  • Interactive containers run -it
  • Containers are built from images that themselves are built based on specification called Dockerfiles.
  • The Dockerfile constains commands and other specifications that describe how the image needs to be constructed.
  • Each Dockerfile is itself built off another base image, resulting in a layered approach that facilitates re-use.
  • Images can be hosted on Dockerhub or Quay.Io
					
FROM quay.io/refgenomics/docker-ubuntu:14.04

MAINTAINER Nik Krumm <nkrumm@gmail.com>

RUN git clone https://github.com/lh3/bwa && \
	cd bwa && \
	git checkout 0.7.10 &&  \
	make && cp bwa /usr/local/bin/bwa

RUN apt-get install -y samtools

# Convenience commands
ADD align.py /usr/local/bin/align.py
RUN chmod +x /usr/local/bin/align.py
RUN ln -s /usr/local/bin/align.py /usr/local/bin/align
CMD ["/usr/local/bin/align"]
					
				

From: onecodex/docker-bwa

Docker images and repositories

Docker images are built using the docker build command with a reference to the path containing the Dockerfile. Images can optionally be pushed to DockerHub using the push command.

					
$ docker build -t pvanheus/aligntool:latest .

$ docker push pvanheus/aligntool:latest
					
				

The Quay.Io repository

The DockerHub is run by Docker, Inc. but alternative repositories exist, most notably quay.io. Quay is notable for having a powerful API and thus allowing for integration in automated workflows.

It also supports Github integration.

Containers beyond Docker

  • Docker is by far the most popular container solution, however...
  • Concerns exist about Docker security and root escalation
  • Multi user environments like HPC clusters are loathe to install Docker
  • Thus: Singularity (and Rocket)

Singularity

  • Singularity emerged out of Lawrence Berkeley Labs
  • Targetted at HPC community
  • Images are file based
  • Can build images from scratch or import from Docker
  • Minimises use of root
  • Run as user: no privileged operations

Assignment

  • Dockerize your own software
  • Either choose your Honours project or ask Peter for software
  • Write Dockerfile build image
  • Use EXPOSE for ports and VOLUME for volumes
  • Publish image on Docker Hub or Quay.Io
  • Convert your Docker image for use with Singularity
  • or explain why Singularity cannot be used for your project
  • Send README describing software and how to run it to Peter <pvh@sanbi.ac.za>