network-chaos-tool 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- network_chaos_tool-0.1.0/PKG-INFO +105 -0
- network_chaos_tool-0.1.0/README.md +97 -0
- network_chaos_tool-0.1.0/injector/__init__.py +0 -0
- network_chaos_tool-0.1.0/injector/__main__.py +4 -0
- network_chaos_tool-0.1.0/injector/cli.py +61 -0
- network_chaos_tool-0.1.0/injector/docker_client.py +50 -0
- network_chaos_tool-0.1.0/injector/network_chaos.py +139 -0
- network_chaos_tool-0.1.0/injector/sidecar_runner.py +120 -0
- network_chaos_tool-0.1.0/network_chaos_tool.egg-info/PKG-INFO +105 -0
- network_chaos_tool-0.1.0/network_chaos_tool.egg-info/SOURCES.txt +14 -0
- network_chaos_tool-0.1.0/network_chaos_tool.egg-info/dependency_links.txt +1 -0
- network_chaos_tool-0.1.0/network_chaos_tool.egg-info/entry_points.txt +3 -0
- network_chaos_tool-0.1.0/network_chaos_tool.egg-info/requires.txt +1 -0
- network_chaos_tool-0.1.0/network_chaos_tool.egg-info/top_level.txt +1 -0
- network_chaos_tool-0.1.0/pyproject.toml +17 -0
- network_chaos_tool-0.1.0/setup.cfg +4 -0
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: network-chaos-tool
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Programmable chaos engineering tool for Docker-based network topologies
|
|
5
|
+
Requires-Python: >=3.10
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Requires-Dist: docker>=7.0.0
|
|
8
|
+
|
|
9
|
+
# Network Chaos Tool
|
|
10
|
+
|
|
11
|
+
A programmable chaos engineering tool that targets Docker-based network topologies. It injects network faults, like latency, packet loss, and more, into running containers at runtime using `tc` (traffic control) via a privileged sidecar, without requiring any special capabilities inside the target containers themselves.
|
|
12
|
+
|
|
13
|
+
## Goal
|
|
14
|
+
|
|
15
|
+
The end goal is a full-featured chaos engineering platform for Docker networks. It will allow users to define fault scenarios (e.g., "break OSPF adjacency," "add 200ms latency + 10% packet loss on an uplink") and run them on demand against containerized infrastructure. An observability stack will monitor and record how the network recovers.
|
|
16
|
+
|
|
17
|
+
## What Has Been Done So Far (Phase 1)
|
|
18
|
+
|
|
19
|
+
Phase 1 is a fully functional Proof of Concept (PoC) that proves we can manipulate the Linux kernel network parameters of any Docker container to simulate disasters.
|
|
20
|
+
|
|
21
|
+
### Architecture
|
|
22
|
+
|
|
23
|
+
- **One-shot privileged sidecar (`chaos-sidecar`)**: A Docker image containing the injector logic. It runs with `--privileged` and `--pid=host`, mounts the Docker socket, and enters the target container's network namespace using `nsenter` to run `tc` commands directly.
|
|
24
|
+
- **Host-side CLI wrapper (`chaosctl`)**: A lightweight Python script that hides the `docker run` boilerplate. It auto-builds the sidecar image if missing and forwards your arguments transparently.
|
|
25
|
+
- **Effect stacking**: Latency and packet loss can be combined. Running `--action latency` then `--action loss` produces a single composite `tc` rule (`delay 500ms loss 20%`) instead of overwriting the previous effect.
|
|
26
|
+
- **Idempotent recovery**: A `--action clear` command removes all `tc` rules and restores normal network behavior.
|
|
27
|
+
- **Zero victim requirements**: Target containers need **no extra capabilities**, no `iproute2`, and no pre-configuration. Any running Docker container can be a target.
|
|
28
|
+
|
|
29
|
+
### Supported Faults
|
|
30
|
+
|
|
31
|
+
| Action | Description |
|
|
32
|
+
| --------- | ---------------------------------------------------------- |
|
|
33
|
+
| `latency` | Add a fixed delay (ms) to all outgoing traffic. |
|
|
34
|
+
| `loss` | Add a random packet loss (%) to all outgoing traffic. |
|
|
35
|
+
| `clear` | Remove all `tc` rules and restore normal network behavior. |
|
|
36
|
+
|
|
37
|
+
## Prerequisites
|
|
38
|
+
|
|
39
|
+
- Docker Engine running locally.
|
|
40
|
+
- Python 3.10+ with [uv](https://github.com/astral-sh/uv) installed.
|
|
41
|
+
- A Linux host (or VM) where `nsenter` and `tc` are available.
|
|
42
|
+
|
|
43
|
+
## Quick Start
|
|
44
|
+
|
|
45
|
+
### 1. Build the victim test container
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
docker build -t chaos-victim tests/victim
|
|
49
|
+
docker run -d --name victim chaos-victim
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
### 2. Verify baseline connectivity
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 3. Inject chaos via `chaosctl`
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
# Add 500ms latency
|
|
62
|
+
uv run chaosctl --target victim --action latency --value 500
|
|
63
|
+
|
|
64
|
+
# Verify the effect
|
|
65
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
66
|
+
|
|
67
|
+
# Stack 20% packet loss on top
|
|
68
|
+
uv run chaosctl --target victim --action loss --value 20
|
|
69
|
+
|
|
70
|
+
# Verify both effects
|
|
71
|
+
docker exec victim ping -c 20 8.8.8.8
|
|
72
|
+
|
|
73
|
+
# Recover
|
|
74
|
+
uv run chaosctl --target victim --action clear
|
|
75
|
+
|
|
76
|
+
# Verify recovery
|
|
77
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### 4. Force rebuild the sidecar (optional)
|
|
81
|
+
|
|
82
|
+
If you modify the sidecar `Dockerfile` or the injector code:
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
uv run chaosctl --target victim --action clear --rebuild
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Project Structure
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
proj/
|
|
92
|
+
├── pyproject.toml # uv project config
|
|
93
|
+
├── uv.lock # Locked dependency tree
|
|
94
|
+
├── Dockerfile # Sidecar image definition
|
|
95
|
+
├── README.md # This file
|
|
96
|
+
├── injector/ # Core chaos logic
|
|
97
|
+
│ ├── cli.py # Runs inside sidecar (nsenter logic)
|
|
98
|
+
│ ├── docker_client.py # Resolves container name -> PID
|
|
99
|
+
│ ├── network_chaos.py # tc command builder with stacking
|
|
100
|
+
│ ├── sidecar_runner.py # Host wrapper (chaosctl entrypoint)
|
|
101
|
+
│ └── __main__.py # Entry point for `python -m injector`
|
|
102
|
+
└── tests/
|
|
103
|
+
└── victim/
|
|
104
|
+
└── Dockerfile # Minimal Alpine victim for testing
|
|
105
|
+
```
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
# Network Chaos Tool
|
|
2
|
+
|
|
3
|
+
A programmable chaos engineering tool that targets Docker-based network topologies. It injects network faults, like latency, packet loss, and more, into running containers at runtime using `tc` (traffic control) via a privileged sidecar, without requiring any special capabilities inside the target containers themselves.
|
|
4
|
+
|
|
5
|
+
## Goal
|
|
6
|
+
|
|
7
|
+
The end goal is a full-featured chaos engineering platform for Docker networks. It will allow users to define fault scenarios (e.g., "break OSPF adjacency," "add 200ms latency + 10% packet loss on an uplink") and run them on demand against containerized infrastructure. An observability stack will monitor and record how the network recovers.
|
|
8
|
+
|
|
9
|
+
## What Has Been Done So Far (Phase 1)
|
|
10
|
+
|
|
11
|
+
Phase 1 is a fully functional Proof of Concept (PoC) that proves we can manipulate the Linux kernel network parameters of any Docker container to simulate disasters.
|
|
12
|
+
|
|
13
|
+
### Architecture
|
|
14
|
+
|
|
15
|
+
- **One-shot privileged sidecar (`chaos-sidecar`)**: A Docker image containing the injector logic. It runs with `--privileged` and `--pid=host`, mounts the Docker socket, and enters the target container's network namespace using `nsenter` to run `tc` commands directly.
|
|
16
|
+
- **Host-side CLI wrapper (`chaosctl`)**: A lightweight Python script that hides the `docker run` boilerplate. It auto-builds the sidecar image if missing and forwards your arguments transparently.
|
|
17
|
+
- **Effect stacking**: Latency and packet loss can be combined. Running `--action latency` then `--action loss` produces a single composite `tc` rule (`delay 500ms loss 20%`) instead of overwriting the previous effect.
|
|
18
|
+
- **Idempotent recovery**: A `--action clear` command removes all `tc` rules and restores normal network behavior.
|
|
19
|
+
- **Zero victim requirements**: Target containers need **no extra capabilities**, no `iproute2`, and no pre-configuration. Any running Docker container can be a target.
|
|
20
|
+
|
|
21
|
+
### Supported Faults
|
|
22
|
+
|
|
23
|
+
| Action | Description |
|
|
24
|
+
| --------- | ---------------------------------------------------------- |
|
|
25
|
+
| `latency` | Add a fixed delay (ms) to all outgoing traffic. |
|
|
26
|
+
| `loss` | Add a random packet loss (%) to all outgoing traffic. |
|
|
27
|
+
| `clear` | Remove all `tc` rules and restore normal network behavior. |
|
|
28
|
+
|
|
29
|
+
## Prerequisites
|
|
30
|
+
|
|
31
|
+
- Docker Engine running locally.
|
|
32
|
+
- Python 3.10+ with [uv](https://github.com/astral-sh/uv) installed.
|
|
33
|
+
- A Linux host (or VM) where `nsenter` and `tc` are available.
|
|
34
|
+
|
|
35
|
+
## Quick Start
|
|
36
|
+
|
|
37
|
+
### 1. Build the victim test container
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
docker build -t chaos-victim tests/victim
|
|
41
|
+
docker run -d --name victim chaos-victim
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### 2. Verify baseline connectivity
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
### 3. Inject chaos via `chaosctl`
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
# Add 500ms latency
|
|
54
|
+
uv run chaosctl --target victim --action latency --value 500
|
|
55
|
+
|
|
56
|
+
# Verify the effect
|
|
57
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
58
|
+
|
|
59
|
+
# Stack 20% packet loss on top
|
|
60
|
+
uv run chaosctl --target victim --action loss --value 20
|
|
61
|
+
|
|
62
|
+
# Verify both effects
|
|
63
|
+
docker exec victim ping -c 20 8.8.8.8
|
|
64
|
+
|
|
65
|
+
# Recover
|
|
66
|
+
uv run chaosctl --target victim --action clear
|
|
67
|
+
|
|
68
|
+
# Verify recovery
|
|
69
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### 4. Force rebuild the sidecar (optional)
|
|
73
|
+
|
|
74
|
+
If you modify the sidecar `Dockerfile` or the injector code:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
uv run chaosctl --target victim --action clear --rebuild
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Project Structure
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
proj/
|
|
84
|
+
├── pyproject.toml # uv project config
|
|
85
|
+
├── uv.lock # Locked dependency tree
|
|
86
|
+
├── Dockerfile # Sidecar image definition
|
|
87
|
+
├── README.md # This file
|
|
88
|
+
├── injector/ # Core chaos logic
|
|
89
|
+
│ ├── cli.py # Runs inside sidecar (nsenter logic)
|
|
90
|
+
│ ├── docker_client.py # Resolves container name -> PID
|
|
91
|
+
│ ├── network_chaos.py # tc command builder with stacking
|
|
92
|
+
│ ├── sidecar_runner.py # Host wrapper (chaosctl entrypoint)
|
|
93
|
+
│ └── __main__.py # Entry point for `python -m injector`
|
|
94
|
+
└── tests/
|
|
95
|
+
└── victim/
|
|
96
|
+
└── Dockerfile # Minimal Alpine victim for testing
|
|
97
|
+
```
|
|
File without changes
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
"""CLI entry point for the chaos injector."""
|
|
2
|
+
|
|
3
|
+
import argparse
|
|
4
|
+
import sys
|
|
5
|
+
|
|
6
|
+
from injector.docker_client import get_container_pid
|
|
7
|
+
from injector.network_chaos import add_latency, add_loss, clear_rules
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
def main():
|
|
11
|
+
parser = argparse.ArgumentParser(
|
|
12
|
+
description="Inject network chaos into a Docker container."
|
|
13
|
+
)
|
|
14
|
+
parser.add_argument(
|
|
15
|
+
"--target",
|
|
16
|
+
"-t",
|
|
17
|
+
required=True,
|
|
18
|
+
help="Name or ID of the target container.",
|
|
19
|
+
)
|
|
20
|
+
parser.add_argument(
|
|
21
|
+
"--action",
|
|
22
|
+
"-a",
|
|
23
|
+
required=True,
|
|
24
|
+
choices=["latency", "loss", "clear"],
|
|
25
|
+
help="Chaos action to apply.",
|
|
26
|
+
)
|
|
27
|
+
parser.add_argument(
|
|
28
|
+
"--value",
|
|
29
|
+
"-v",
|
|
30
|
+
type=int,
|
|
31
|
+
help="Value for the action (ms for latency, percent for loss). Not needed for 'clear'.",
|
|
32
|
+
)
|
|
33
|
+
|
|
34
|
+
args = parser.parse_args()
|
|
35
|
+
|
|
36
|
+
if args.action in ("latency", "loss") and args.value is None:
|
|
37
|
+
parser.error(f"--value is required when action is '{args.action}'.")
|
|
38
|
+
|
|
39
|
+
try:
|
|
40
|
+
pid = get_container_pid(args.target)
|
|
41
|
+
except ValueError as exc:
|
|
42
|
+
print(f"Error: {exc}", file=sys.stderr)
|
|
43
|
+
sys.exit(1)
|
|
44
|
+
|
|
45
|
+
try:
|
|
46
|
+
if args.action == "latency":
|
|
47
|
+
add_latency(pid, args.value)
|
|
48
|
+
print(f"Added {args.value}ms latency to container '{args.target}'.")
|
|
49
|
+
elif args.action == "loss":
|
|
50
|
+
add_loss(pid, args.value)
|
|
51
|
+
print(f"Added {args.value}% packet loss to container '{args.target}'.")
|
|
52
|
+
elif args.action == "clear":
|
|
53
|
+
clear_rules(pid)
|
|
54
|
+
print(f"Cleared all tc rules from container '{args.target}'.")
|
|
55
|
+
except RuntimeError as exc:
|
|
56
|
+
print(f"Error: {exc}", file=sys.stderr)
|
|
57
|
+
sys.exit(1)
|
|
58
|
+
|
|
59
|
+
|
|
60
|
+
if __name__ == "__main__":
|
|
61
|
+
main()
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
"""Thin wrapper around the Docker SDK for container lookup."""
|
|
2
|
+
|
|
3
|
+
import docker
|
|
4
|
+
from docker.errors import NotFound
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
def get_container(name_or_id: str):
|
|
8
|
+
"""Find a running container by name or ID.
|
|
9
|
+
|
|
10
|
+
Args:
|
|
11
|
+
name_or_id: The Docker container name or short/long ID.
|
|
12
|
+
|
|
13
|
+
Returns:
|
|
14
|
+
A docker.models.containers.Container object.
|
|
15
|
+
|
|
16
|
+
Raises:
|
|
17
|
+
ValueError: If the container is not found or not running.
|
|
18
|
+
"""
|
|
19
|
+
client = docker.from_env()
|
|
20
|
+
|
|
21
|
+
try:
|
|
22
|
+
container = client.containers.get(name_or_id)
|
|
23
|
+
except NotFound as exc:
|
|
24
|
+
raise ValueError(f"Container '{name_or_id}' not found.") from exc
|
|
25
|
+
|
|
26
|
+
if container.status != "running":
|
|
27
|
+
raise ValueError(
|
|
28
|
+
f"Container '{name_or_id}' is not running (status: {container.status})."
|
|
29
|
+
)
|
|
30
|
+
|
|
31
|
+
return container
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
def get_container_pid(name_or_id: str) -> int:
|
|
35
|
+
"""Find a running container and return its host PID.
|
|
36
|
+
|
|
37
|
+
Args:
|
|
38
|
+
name_or_id: The Docker container name or short/long ID.
|
|
39
|
+
|
|
40
|
+
Returns:
|
|
41
|
+
The host PID (int) of the container's init process.
|
|
42
|
+
|
|
43
|
+
Raises:
|
|
44
|
+
ValueError: If the container is not found or not running.
|
|
45
|
+
"""
|
|
46
|
+
container = get_container(name_or_id)
|
|
47
|
+
pid = container.attrs.get("State", {}).get("Pid")
|
|
48
|
+
if not pid:
|
|
49
|
+
raise ValueError(f"Could not determine PID for container '{name_or_id}'.")
|
|
50
|
+
return pid
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
"""Network chaos injection using tc (traffic control) via nsenter."""
|
|
2
|
+
|
|
3
|
+
import re
|
|
4
|
+
import subprocess
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
def _exec_tc_in_netns(pid: int, command: str):
|
|
8
|
+
"""Execute a tc command inside the target container's network namespace.
|
|
9
|
+
|
|
10
|
+
Args:
|
|
11
|
+
pid: Host PID of the target container's init process.
|
|
12
|
+
command: The tc sub-command (e.g. "qdisc add dev eth0 root netem delay 500ms").
|
|
13
|
+
|
|
14
|
+
Returns:
|
|
15
|
+
subprocess.CompletedProcess result.
|
|
16
|
+
"""
|
|
17
|
+
full_cmd = ["nsenter", "-t", str(pid), "-n", "tc"] + command.split()
|
|
18
|
+
result = subprocess.run(
|
|
19
|
+
full_cmd,
|
|
20
|
+
capture_output=True,
|
|
21
|
+
text=True,
|
|
22
|
+
)
|
|
23
|
+
return result
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
def _is_no_such_file_error(output: str) -> bool:
|
|
27
|
+
"""Check if tc output indicates the qdisc does not exist."""
|
|
28
|
+
text = output.lower()
|
|
29
|
+
return (
|
|
30
|
+
"no such file" in text
|
|
31
|
+
or "cannot find device" in text
|
|
32
|
+
or "invalid argument" in text
|
|
33
|
+
or "cannot delete qdisc with handle of zero" in text
|
|
34
|
+
)
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+
def _get_current_netem_params(pid: int) -> dict:
|
|
38
|
+
"""Inspect current netem parameters on eth0.
|
|
39
|
+
|
|
40
|
+
Returns:
|
|
41
|
+
A dict like {"delay": 500, "loss": 20} or {} if no netem qdisc exists.
|
|
42
|
+
"""
|
|
43
|
+
result = _exec_tc_in_netns(pid, "qdisc show dev eth0")
|
|
44
|
+
if result.returncode != 0:
|
|
45
|
+
return {}
|
|
46
|
+
|
|
47
|
+
# Look for a line like:
|
|
48
|
+
# qdisc netem 8002: root refcnt 2 limit 1000 delay 500ms loss 20%
|
|
49
|
+
match = re.search(r"qdisc\s+netem\s+\S+:\s+root.*", result.stdout)
|
|
50
|
+
if not match:
|
|
51
|
+
return {}
|
|
52
|
+
|
|
53
|
+
line = match.group(0)
|
|
54
|
+
params = {}
|
|
55
|
+
delay_match = re.search(r"delay\s+(\d+)ms", line)
|
|
56
|
+
if delay_match:
|
|
57
|
+
params["delay"] = int(delay_match.group(1))
|
|
58
|
+
loss_match = re.search(r"loss\s+(\d+(?:\.\d+)?)%", line)
|
|
59
|
+
if loss_match:
|
|
60
|
+
params["loss"] = float(loss_match.group(1))
|
|
61
|
+
return params
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
def _build_netem_command(action: str, params: dict) -> str:
|
|
65
|
+
"""Build a tc qdisc command string from action and params.
|
|
66
|
+
|
|
67
|
+
Args:
|
|
68
|
+
action: Either "add" or "change".
|
|
69
|
+
params: Dict with optional keys "delay" (int, ms) and "loss" (float, percent).
|
|
70
|
+
|
|
71
|
+
Returns:
|
|
72
|
+
Full tc sub-command string (e.g. "qdisc change dev eth0 root netem delay 500ms loss 20%").
|
|
73
|
+
"""
|
|
74
|
+
parts = [f"qdisc {action} dev eth0 root netem"]
|
|
75
|
+
if "delay" in params:
|
|
76
|
+
parts.append(f"delay {int(params['delay'])}ms")
|
|
77
|
+
if "loss" in params:
|
|
78
|
+
parts.append(f"loss {float(params['loss'])}%")
|
|
79
|
+
return " ".join(parts)
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
def clear_rules(pid: int):
|
|
83
|
+
"""Remove all tc rules on eth0 inside the container's network namespace.
|
|
84
|
+
|
|
85
|
+
This is idempotent: if no rules exist, it does not raise.
|
|
86
|
+
"""
|
|
87
|
+
result = _exec_tc_in_netns(pid, "qdisc del dev eth0 root")
|
|
88
|
+
if result.returncode != 0 and not _is_no_such_file_error(result.stderr):
|
|
89
|
+
raise RuntimeError(f"Failed to clear tc rules: {result.stderr.strip()}")
|
|
90
|
+
|
|
91
|
+
|
|
92
|
+
def add_latency(pid: int, ms: int):
|
|
93
|
+
"""Add latency to all outgoing traffic on eth0.
|
|
94
|
+
|
|
95
|
+
If a loss rule already exists, it is preserved and merged.
|
|
96
|
+
|
|
97
|
+
Args:
|
|
98
|
+
pid: Host PID of the target container's init process.
|
|
99
|
+
ms: Latency in milliseconds.
|
|
100
|
+
"""
|
|
101
|
+
if ms < 0:
|
|
102
|
+
raise ValueError("Latency must be non-negative.")
|
|
103
|
+
|
|
104
|
+
current = _get_current_netem_params(pid)
|
|
105
|
+
current["delay"] = ms
|
|
106
|
+
|
|
107
|
+
action = "change" if _has_netem_qdisc(pid) else "add"
|
|
108
|
+
cmd = _build_netem_command(action, current)
|
|
109
|
+
result = _exec_tc_in_netns(pid, cmd)
|
|
110
|
+
if result.returncode != 0:
|
|
111
|
+
raise RuntimeError(f"Failed to add latency: {result.stderr.strip()}")
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
def add_loss(pid: int, percent: int):
|
|
115
|
+
"""Add packet loss to all outgoing traffic on eth0.
|
|
116
|
+
|
|
117
|
+
If a latency rule already exists, it is preserved and merged.
|
|
118
|
+
|
|
119
|
+
Args:
|
|
120
|
+
pid: Host PID of the target container's init process.
|
|
121
|
+
percent: Packet loss percentage (0-100).
|
|
122
|
+
"""
|
|
123
|
+
if not 0 <= percent <= 100:
|
|
124
|
+
raise ValueError("Loss percent must be between 0 and 100.")
|
|
125
|
+
|
|
126
|
+
current = _get_current_netem_params(pid)
|
|
127
|
+
current["loss"] = percent
|
|
128
|
+
|
|
129
|
+
action = "change" if _has_netem_qdisc(pid) else "add"
|
|
130
|
+
cmd = _build_netem_command(action, current)
|
|
131
|
+
result = _exec_tc_in_netns(pid, cmd)
|
|
132
|
+
if result.returncode != 0:
|
|
133
|
+
raise RuntimeError(f"Failed to add loss: {result.stderr.strip()}")
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
def _has_netem_qdisc(pid: int) -> bool:
|
|
137
|
+
"""Check if a netem qdisc currently exists on eth0."""
|
|
138
|
+
result = _exec_tc_in_netns(pid, "qdisc show dev eth0")
|
|
139
|
+
return result.returncode == 0 and "netem" in result.stdout
|
|
@@ -0,0 +1,120 @@
|
|
|
1
|
+
"""Host-side wrapper that launches the chaos-sidecar container."""
|
|
2
|
+
|
|
3
|
+
import argparse
|
|
4
|
+
import subprocess
|
|
5
|
+
import sys
|
|
6
|
+
from pathlib import Path
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
SIDECAR_IMAGE = "chaos-sidecar"
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def _ensure_image():
|
|
13
|
+
"""Check if the sidecar image exists locally; auto-build if missing."""
|
|
14
|
+
result = subprocess.run(
|
|
15
|
+
["docker", "images", "-q", SIDECAR_IMAGE],
|
|
16
|
+
capture_output=True,
|
|
17
|
+
text=True,
|
|
18
|
+
)
|
|
19
|
+
if result.returncode != 0:
|
|
20
|
+
raise RuntimeError("Docker does not seem to be running or is not installed.")
|
|
21
|
+
|
|
22
|
+
if not result.stdout.strip():
|
|
23
|
+
print(f"'{SIDECAR_IMAGE}' image not found locally. Building...")
|
|
24
|
+
_build_image()
|
|
25
|
+
print(f"'{SIDECAR_IMAGE}' built successfully.")
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
def _build_image():
|
|
29
|
+
"""Build the sidecar Docker image from the project root."""
|
|
30
|
+
project_root = Path(__file__).resolve().parent.parent
|
|
31
|
+
dockerfile = project_root / "Dockerfile"
|
|
32
|
+
|
|
33
|
+
if not dockerfile.exists():
|
|
34
|
+
raise RuntimeError(
|
|
35
|
+
f"Dockerfile not found at {dockerfile}. "
|
|
36
|
+
"Cannot auto-build the sidecar image."
|
|
37
|
+
)
|
|
38
|
+
|
|
39
|
+
subprocess.run(
|
|
40
|
+
["docker", "build", "-t", SIDECAR_IMAGE, str(project_root)],
|
|
41
|
+
check=True,
|
|
42
|
+
)
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
def _run_sidecar(args: argparse.Namespace):
|
|
46
|
+
"""Launch the sidecar container with the requested chaos args."""
|
|
47
|
+
cmd = [
|
|
48
|
+
"docker",
|
|
49
|
+
"run",
|
|
50
|
+
"--rm",
|
|
51
|
+
"--privileged",
|
|
52
|
+
"--pid=host",
|
|
53
|
+
"-v",
|
|
54
|
+
"/var/run/docker.sock:/var/run/docker.sock",
|
|
55
|
+
SIDECAR_IMAGE,
|
|
56
|
+
"--target",
|
|
57
|
+
args.target,
|
|
58
|
+
"--action",
|
|
59
|
+
args.action,
|
|
60
|
+
]
|
|
61
|
+
|
|
62
|
+
if args.value is not None:
|
|
63
|
+
cmd.extend(["--value", str(args.value)])
|
|
64
|
+
|
|
65
|
+
subprocess.run(cmd, check=True)
|
|
66
|
+
|
|
67
|
+
|
|
68
|
+
def main():
|
|
69
|
+
parser = argparse.ArgumentParser(
|
|
70
|
+
description="Inject network chaos into a Docker container (host-side wrapper)."
|
|
71
|
+
)
|
|
72
|
+
parser.add_argument(
|
|
73
|
+
"--target",
|
|
74
|
+
"-t",
|
|
75
|
+
required=True,
|
|
76
|
+
help="Name or ID of the target container.",
|
|
77
|
+
)
|
|
78
|
+
parser.add_argument(
|
|
79
|
+
"--action",
|
|
80
|
+
"-a",
|
|
81
|
+
required=True,
|
|
82
|
+
choices=["latency", "loss", "clear"],
|
|
83
|
+
help="Chaos action to apply.",
|
|
84
|
+
)
|
|
85
|
+
parser.add_argument(
|
|
86
|
+
"--value",
|
|
87
|
+
"-v",
|
|
88
|
+
type=int,
|
|
89
|
+
help="Value for the action (ms for latency, percent for loss). Not needed for 'clear'.",
|
|
90
|
+
)
|
|
91
|
+
parser.add_argument(
|
|
92
|
+
"--rebuild",
|
|
93
|
+
action="store_true",
|
|
94
|
+
help="Force rebuild of the sidecar image before running.",
|
|
95
|
+
)
|
|
96
|
+
|
|
97
|
+
args = parser.parse_args()
|
|
98
|
+
|
|
99
|
+
if args.action in ("latency", "loss") and args.value is None:
|
|
100
|
+
parser.error(f"--value is required when action is '{args.action}'.")
|
|
101
|
+
|
|
102
|
+
try:
|
|
103
|
+
if args.rebuild:
|
|
104
|
+
_build_image()
|
|
105
|
+
else:
|
|
106
|
+
_ensure_image()
|
|
107
|
+
|
|
108
|
+
_run_sidecar(args)
|
|
109
|
+
except subprocess.CalledProcessError as exc:
|
|
110
|
+
print(
|
|
111
|
+
f"Error: Command failed with exit code {exc.returncode}.", file=sys.stderr
|
|
112
|
+
)
|
|
113
|
+
sys.exit(1)
|
|
114
|
+
except RuntimeError as exc:
|
|
115
|
+
print(f"Error: {exc}", file=sys.stderr)
|
|
116
|
+
sys.exit(1)
|
|
117
|
+
|
|
118
|
+
|
|
119
|
+
if __name__ == "__main__":
|
|
120
|
+
main()
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: network-chaos-tool
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Programmable chaos engineering tool for Docker-based network topologies
|
|
5
|
+
Requires-Python: >=3.10
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Requires-Dist: docker>=7.0.0
|
|
8
|
+
|
|
9
|
+
# Network Chaos Tool
|
|
10
|
+
|
|
11
|
+
A programmable chaos engineering tool that targets Docker-based network topologies. It injects network faults, like latency, packet loss, and more, into running containers at runtime using `tc` (traffic control) via a privileged sidecar, without requiring any special capabilities inside the target containers themselves.
|
|
12
|
+
|
|
13
|
+
## Goal
|
|
14
|
+
|
|
15
|
+
The end goal is a full-featured chaos engineering platform for Docker networks. It will allow users to define fault scenarios (e.g., "break OSPF adjacency," "add 200ms latency + 10% packet loss on an uplink") and run them on demand against containerized infrastructure. An observability stack will monitor and record how the network recovers.
|
|
16
|
+
|
|
17
|
+
## What Has Been Done So Far (Phase 1)
|
|
18
|
+
|
|
19
|
+
Phase 1 is a fully functional Proof of Concept (PoC) that proves we can manipulate the Linux kernel network parameters of any Docker container to simulate disasters.
|
|
20
|
+
|
|
21
|
+
### Architecture
|
|
22
|
+
|
|
23
|
+
- **One-shot privileged sidecar (`chaos-sidecar`)**: A Docker image containing the injector logic. It runs with `--privileged` and `--pid=host`, mounts the Docker socket, and enters the target container's network namespace using `nsenter` to run `tc` commands directly.
|
|
24
|
+
- **Host-side CLI wrapper (`chaosctl`)**: A lightweight Python script that hides the `docker run` boilerplate. It auto-builds the sidecar image if missing and forwards your arguments transparently.
|
|
25
|
+
- **Effect stacking**: Latency and packet loss can be combined. Running `--action latency` then `--action loss` produces a single composite `tc` rule (`delay 500ms loss 20%`) instead of overwriting the previous effect.
|
|
26
|
+
- **Idempotent recovery**: A `--action clear` command removes all `tc` rules and restores normal network behavior.
|
|
27
|
+
- **Zero victim requirements**: Target containers need **no extra capabilities**, no `iproute2`, and no pre-configuration. Any running Docker container can be a target.
|
|
28
|
+
|
|
29
|
+
### Supported Faults
|
|
30
|
+
|
|
31
|
+
| Action | Description |
|
|
32
|
+
| --------- | ---------------------------------------------------------- |
|
|
33
|
+
| `latency` | Add a fixed delay (ms) to all outgoing traffic. |
|
|
34
|
+
| `loss` | Add a random packet loss (%) to all outgoing traffic. |
|
|
35
|
+
| `clear` | Remove all `tc` rules and restore normal network behavior. |
|
|
36
|
+
|
|
37
|
+
## Prerequisites
|
|
38
|
+
|
|
39
|
+
- Docker Engine running locally.
|
|
40
|
+
- Python 3.10+ with [uv](https://github.com/astral-sh/uv) installed.
|
|
41
|
+
- A Linux host (or VM) where `nsenter` and `tc` are available.
|
|
42
|
+
|
|
43
|
+
## Quick Start
|
|
44
|
+
|
|
45
|
+
### 1. Build the victim test container
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
docker build -t chaos-victim tests/victim
|
|
49
|
+
docker run -d --name victim chaos-victim
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
### 2. Verify baseline connectivity
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 3. Inject chaos via `chaosctl`
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
# Add 500ms latency
|
|
62
|
+
uv run chaosctl --target victim --action latency --value 500
|
|
63
|
+
|
|
64
|
+
# Verify the effect
|
|
65
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
66
|
+
|
|
67
|
+
# Stack 20% packet loss on top
|
|
68
|
+
uv run chaosctl --target victim --action loss --value 20
|
|
69
|
+
|
|
70
|
+
# Verify both effects
|
|
71
|
+
docker exec victim ping -c 20 8.8.8.8
|
|
72
|
+
|
|
73
|
+
# Recover
|
|
74
|
+
uv run chaosctl --target victim --action clear
|
|
75
|
+
|
|
76
|
+
# Verify recovery
|
|
77
|
+
docker exec victim ping -c 4 8.8.8.8
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### 4. Force rebuild the sidecar (optional)
|
|
81
|
+
|
|
82
|
+
If you modify the sidecar `Dockerfile` or the injector code:
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
uv run chaosctl --target victim --action clear --rebuild
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Project Structure
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
proj/
|
|
92
|
+
├── pyproject.toml # uv project config
|
|
93
|
+
├── uv.lock # Locked dependency tree
|
|
94
|
+
├── Dockerfile # Sidecar image definition
|
|
95
|
+
├── README.md # This file
|
|
96
|
+
├── injector/ # Core chaos logic
|
|
97
|
+
│ ├── cli.py # Runs inside sidecar (nsenter logic)
|
|
98
|
+
│ ├── docker_client.py # Resolves container name -> PID
|
|
99
|
+
│ ├── network_chaos.py # tc command builder with stacking
|
|
100
|
+
│ ├── sidecar_runner.py # Host wrapper (chaosctl entrypoint)
|
|
101
|
+
│ └── __main__.py # Entry point for `python -m injector`
|
|
102
|
+
└── tests/
|
|
103
|
+
└── victim/
|
|
104
|
+
└── Dockerfile # Minimal Alpine victim for testing
|
|
105
|
+
```
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
pyproject.toml
|
|
3
|
+
injector/__init__.py
|
|
4
|
+
injector/__main__.py
|
|
5
|
+
injector/cli.py
|
|
6
|
+
injector/docker_client.py
|
|
7
|
+
injector/network_chaos.py
|
|
8
|
+
injector/sidecar_runner.py
|
|
9
|
+
network_chaos_tool.egg-info/PKG-INFO
|
|
10
|
+
network_chaos_tool.egg-info/SOURCES.txt
|
|
11
|
+
network_chaos_tool.egg-info/dependency_links.txt
|
|
12
|
+
network_chaos_tool.egg-info/entry_points.txt
|
|
13
|
+
network_chaos_tool.egg-info/requires.txt
|
|
14
|
+
network_chaos_tool.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
docker>=7.0.0
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
injector
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "network-chaos-tool"
|
|
3
|
+
version = "0.1.0"
|
|
4
|
+
description = "Programmable chaos engineering tool for Docker-based network topologies"
|
|
5
|
+
readme = "README.md"
|
|
6
|
+
requires-python = ">=3.10"
|
|
7
|
+
dependencies = [
|
|
8
|
+
"docker>=7.0.0",
|
|
9
|
+
]
|
|
10
|
+
|
|
11
|
+
[project.scripts]
|
|
12
|
+
chaos = "injector.cli:main"
|
|
13
|
+
chaosctl = "injector.sidecar_runner:main"
|
|
14
|
+
|
|
15
|
+
[build-system]
|
|
16
|
+
requires = ["setuptools>=61.0"]
|
|
17
|
+
build-backend = "setuptools.build_meta"
|