Python: Optimization Facilities

22 Mar 2024 - Houston

Contents
  1. Optimization Principles
  2. Tools
    1. Profiling
      1. strace / dtruss
    2. HTTP Server Stressing
      1. wrk
      2. Alternatives
  3. Caching
    1. memcached, Redis
    2. Bloom filter
      1. Motivation
      2. pybloom-live
  4. Distributing Work: Concurrency and Parallelism
    1. Limits to parallelism: Amdahl’s Law
    2. cPython’s Global Interpreter Lock (GIL)
    3. I/O-bound vs CPU-bound programs
      1. I/O-bound programs
      2. CPU-bound programs
    4. Usage / Rules of Thumb
      1. Threads
      2. Processes
  5. Examples
    1. ETL (Extract, Transform, Load)
    2. celery
  6. Writing Servers
    1. C10K Problem
    2. Async I/O
      1. Tools / Implementations

Optimization Principles

  • TBP: Target, Benchmark, Profile
  • Have target numbers in mind (e.g. SLAs that need be met)
  • Don’t optimize without benchmarking
  • Ensure benchmarks are taken against realistic data
  • Don’t optimize without profiling

Tools

Profiling

strace / dtruss

Profile system calls. dtrace is a Mac alternative to strace. dtruss wraps dtrace.

HTTP Server Stressing

wrk

Simple, provides useful baseline data about app performance

wrk --connections=3 --threads=3 --script=wrk-geo.lua https://localhost:8080/dist

Alternatives

  • Apache ab
  • vegeta
  • Locust

Caching

memcached, Redis

Bloom filter

Used to determine if an item is part of a set or not. A space-efficient, probabilistic data structure. Provides a “definitely not” or “probably yes” response.

Motivation

Common problem in streaming applications: Handling of the same event more than once. “Exactly once” delivery is difficult to achieve. Most systems settle on “at least once” delivery.

pybloom-live

Provides a BloomFilter implementation

Distributing Work: Concurrency and Parallelism

Limits to parallelism: Amdahl’s Law

Formula for the theoretical limits of speed increase or latency as a result of the parallel computing.

\[ S(N) = \frac{1}{(1-P) + (P/N)} \]

cPython’s Global Interpreter Lock (GIL)

Implication is that a single Python process can only use a single core.

I/O-bound vs CPU-bound programs

I/O-bound programs

Performance-constrained by input/output, such as network or disk Not significantly impacted by GIL (why?)

CPU-bound programs

Performance-constrained by expensive computations. Significantly impacted by GIL (why?). Means we have to reach across the process boundary to overcome CPU constraints on performance and get good parallelism.

Usage / Rules of Thumb

Threads

  • Use when I/O-bound. Traditional drivers and networking.

Processes

  • Use when CPU-bound. Not a lot of comms between processes.

Examples

ETL (Extract, Transform, Load)

Service gets HTTP server logs, stores them in a database. Service extracts some fields from logs and inserts the lines into the database.

celery

Uses a job queue called a broker Supports Redis and RabbitMQ

Writing Servers

C10K Problem

Context-switching is expensive

Async I/O

Great for I/O-bound programs. Handle several tasks with one thread.

Tools / Implementations

  • asyncio module
  • Twisted
  • Tornado web server