kubernetes 0.0.1 → 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Makefile +53 -0
- data/docs/access.md +252 -0
- data/docs/architecture.dia +0 -0
- data/docs/architecture.svg +523 -0
- data/docs/client-libraries.md +11 -0
- data/docs/flaky-tests.md +52 -0
- data/docs/identifiers.md +90 -0
- data/docs/networking.md +108 -0
- data/docs/ovs-networking.md +14 -0
- data/docs/ovs-networking.png +0 -0
- data/docs/pods.md +23 -0
- data/docs/releasing.dot +113 -0
- data/docs/releasing.md +152 -0
- data/docs/releasing.png +0 -0
- data/docs/resources.md +218 -0
- data/docs/roadmap.md +65 -0
- data/docs/salt.md +85 -0
- data/docs/security.md +26 -0
- metadata +22 -4
@@ -0,0 +1,11 @@
|
|
1
|
+
## kubernetes API client libraries
|
2
|
+
|
3
|
+
### Supported
|
4
|
+
* [Go](https://github.com/GoogleCloudPlatform/kubernetes/tree/master/pkg/client)
|
5
|
+
|
6
|
+
### User Contributed
|
7
|
+
*Note: Libraries provided by outside parties are supported by their authors, not the core Kubernetes team*
|
8
|
+
|
9
|
+
* [Java](https://github.com/nirmal070125/KubernetesAPIJavaClient)
|
10
|
+
* [Ruby](https://github.com/Ch00k/kuber)
|
11
|
+
|
data/docs/flaky-tests.md
ADDED
@@ -0,0 +1,52 @@
|
|
1
|
+
# Hunting flaky tests in Kubernetes
|
2
|
+
Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.
|
3
|
+
|
4
|
+
We have a goal of 99.9% flake free tests. This means that there is only one flake in one thousand runs of a test.
|
5
|
+
|
6
|
+
Running a test 1000 times on your own machine can be tedious and time consuming. Fortunately, there is a better way to achieve this using Kubernetes.
|
7
|
+
|
8
|
+
_Note: these instructions are mildly hacky for now, as we get run once semantics and logging they will get better_
|
9
|
+
|
10
|
+
There is a testing image ```brendanburns/flake``` up on the docker hub. We will use this image to test our fix.
|
11
|
+
|
12
|
+
Create a replication controller with the following config:
|
13
|
+
```yaml
|
14
|
+
id: flakeController
|
15
|
+
desiredState:
|
16
|
+
replicas: 24
|
17
|
+
replicaSelector:
|
18
|
+
name: flake
|
19
|
+
podTemplate:
|
20
|
+
desiredState:
|
21
|
+
manifest:
|
22
|
+
version: v1beta1
|
23
|
+
id: ""
|
24
|
+
volumes: []
|
25
|
+
containers:
|
26
|
+
- name: flake
|
27
|
+
image: brendanburns/flake
|
28
|
+
env:
|
29
|
+
- name: TEST_PACKAGE
|
30
|
+
value: pkg/tools
|
31
|
+
- name: REPO_SPEC
|
32
|
+
value: https://github.com/GoogleCloudPlatform/kubernetes
|
33
|
+
restartpolicy: {}
|
34
|
+
labels:
|
35
|
+
name: flake
|
36
|
+
labels:
|
37
|
+
name: flake
|
38
|
+
```
|
39
|
+
|
40
|
+
```./cluster/kubecfg.sh -c controller.yaml create replicaControllers```
|
41
|
+
|
42
|
+
This will spin up 100 instances of the test. They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient
|
43
|
+
runs for your purposes, and you can stop the replication controller:
|
44
|
+
|
45
|
+
```sh
|
46
|
+
./cluster/kubecfg.sh stop flakeController
|
47
|
+
./cluster/kubecfg.sh rm flakeController
|
48
|
+
```
|
49
|
+
|
50
|
+
Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller)
|
51
|
+
|
52
|
+
Happy flake hunting!
|
data/docs/identifiers.md
ADDED
@@ -0,0 +1,90 @@
|
|
1
|
+
# Identifiers and Names in Kubernetes
|
2
|
+
|
3
|
+
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199).
|
4
|
+
|
5
|
+
|
6
|
+
## Definitions
|
7
|
+
|
8
|
+
UID
|
9
|
+
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
|
10
|
+
|
11
|
+
Name
|
12
|
+
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
|
13
|
+
|
14
|
+
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
|
15
|
+
: An alphanumeric (a-z, A-Z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
|
16
|
+
|
17
|
+
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
|
18
|
+
: One or more rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
|
19
|
+
|
20
|
+
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
|
21
|
+
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
|
22
|
+
|
23
|
+
|
24
|
+
## Objectives for names and UIDs
|
25
|
+
|
26
|
+
1. Uniquely identify (via a UID) an object across space and time
|
27
|
+
|
28
|
+
2. Uniquely name (via a name) an object across space
|
29
|
+
|
30
|
+
3. Provide human-friendly names in API operations and/or configuration files
|
31
|
+
|
32
|
+
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
|
33
|
+
|
34
|
+
5. Allow DNS names to be automatically generated for some objects
|
35
|
+
|
36
|
+
|
37
|
+
## General design
|
38
|
+
|
39
|
+
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
|
40
|
+
* Examples: "guestbook.user", "backend-x4eb1"
|
41
|
+
|
42
|
+
2. When an object is created via an api, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
|
43
|
+
* Example: "api.k8s.example.com"
|
44
|
+
|
45
|
+
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
|
46
|
+
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
47
|
+
|
48
|
+
|
49
|
+
## Case study: Scheduling a pod
|
50
|
+
|
51
|
+
Pods can be placed onto a particular node in a number of ways. This case
|
52
|
+
study demonstrates how the above design can be applied to satisfy the
|
53
|
+
objectives.
|
54
|
+
|
55
|
+
### A pod scheduled by a user through the apiserver
|
56
|
+
|
57
|
+
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
58
|
+
|
59
|
+
2. The apiserver validates the input.
|
60
|
+
1. A default Namespace is assigned.
|
61
|
+
2. The pod name must be space-unique within the Namespace.
|
62
|
+
3. Each container within the pod has a name which must be space-unique within the pod.
|
63
|
+
|
64
|
+
3. The pod is accepted.
|
65
|
+
1. A new UID is assigned.
|
66
|
+
|
67
|
+
4. The pod is bound to a node.
|
68
|
+
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
69
|
+
|
70
|
+
5. Kubelet validates the input.
|
71
|
+
|
72
|
+
6. Kubelet runs the pod.
|
73
|
+
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
74
|
+
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
75
|
+
* This may correspond to Docker's container ID.
|
76
|
+
|
77
|
+
### A pod placed by a config file on the node
|
78
|
+
|
79
|
+
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
|
80
|
+
|
81
|
+
2. Kubelet validates the input.
|
82
|
+
1. Since UID is not provided, kubelet generates one.
|
83
|
+
2. Since Namespace is not provided, kubelet generates one.
|
84
|
+
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
|
85
|
+
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
86
|
+
|
87
|
+
3. Kubelet runs the pod.
|
88
|
+
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
89
|
+
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
90
|
+
1. This may correspond to Docker's container ID.
|
data/docs/networking.md
ADDED
@@ -0,0 +1,108 @@
|
|
1
|
+
# Networking
|
2
|
+
|
3
|
+
## Model and motivation
|
4
|
+
|
5
|
+
Kubernetes deviates from the default Docker networking model. The goal is for each [pod](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/pods.md) to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.
|
6
|
+
|
7
|
+
OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems.
|
8
|
+
|
9
|
+
With the IP-per-pod model, all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses.
|
10
|
+
|
11
|
+
In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.
|
12
|
+
|
13
|
+
The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.
|
14
|
+
|
15
|
+
When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.
|
16
|
+
|
17
|
+
This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.
|
18
|
+
|
19
|
+
An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms.
|
20
|
+
|
21
|
+
## Current implementation
|
22
|
+
|
23
|
+
For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.
|
24
|
+
|
25
|
+
Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15).
|
26
|
+
|
27
|
+
We start Docker with:
|
28
|
+
DOCKER_OPTS="--bridge cbr0 --iptables=false"
|
29
|
+
|
30
|
+
We set up this bridge on each node with SaltStack, in [container_bridge.py](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/_states/container_bridge.py).
|
31
|
+
|
32
|
+
cbr0:
|
33
|
+
container_bridge.ensure:
|
34
|
+
- cidr: {{ grains['cbr-cidr'] }}
|
35
|
+
...
|
36
|
+
grains:
|
37
|
+
roles:
|
38
|
+
- kubernetes-pool
|
39
|
+
cbr-cidr: $MINION_IP_RANGE
|
40
|
+
|
41
|
+
We make these addresses routable in GCE:
|
42
|
+
|
43
|
+
gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]} \
|
44
|
+
--norespect_terminal_width \
|
45
|
+
--project ${PROJECT} \
|
46
|
+
--network ${NETWORK} \
|
47
|
+
--next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} &
|
48
|
+
|
49
|
+
The minion IP ranges are /24s in the 10-dot space.
|
50
|
+
|
51
|
+
GCE itself does not know anything about these IPs, though.
|
52
|
+
|
53
|
+
These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).)
|
54
|
+
|
55
|
+
We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.
|
56
|
+
|
57
|
+
Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.
|
58
|
+
|
59
|
+
1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name.
|
60
|
+
- creates a new network namespace (netns) and loopback device
|
61
|
+
- creates a new pair of veth devices and binds them to the netns
|
62
|
+
- auto-assigns an IP from docker’s IP range
|
63
|
+
|
64
|
+
2. Create the user containers and specify the name of the network container as their “net” argument. Docker finds the PID of the command running in the network container and attaches to the netns of that PID.
|
65
|
+
|
66
|
+
### Other networking implementation examples
|
67
|
+
With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE.
|
68
|
+
- [OpenVSwitch with GRE/VxLAN](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/ovs-networking.md)
|
69
|
+
- [Flannel](https://github.com/coreos/flannel#flannel)
|
70
|
+
|
71
|
+
## Challenges and future work
|
72
|
+
|
73
|
+
### Docker API
|
74
|
+
|
75
|
+
Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow.
|
76
|
+
|
77
|
+
### External IP assignment
|
78
|
+
|
79
|
+
We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below).
|
80
|
+
|
81
|
+
### Naming, discovery, and load balancing
|
82
|
+
|
83
|
+
In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically.
|
84
|
+
|
85
|
+
Service endpoints are currently found through [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) environment variables specifying ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time. Regardless, we're considering moving away from the current approach to an approach more akin to our approach for individual pods: allocate an IP address per service and automatically register the service in DDNS -- L3 load balancing, essentially. Using a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints.
|
86
|
+
|
87
|
+
We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier.
|
88
|
+
|
89
|
+
### External routability
|
90
|
+
|
91
|
+
We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers).
|
92
|
+
|
93
|
+
We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP.
|
94
|
+
|
95
|
+
So we end up with 3 cases:
|
96
|
+
|
97
|
+
1. Container -> Container or Container <-> VM. These should use 10. addresses directly and there should be no NAT.
|
98
|
+
|
99
|
+
2. Container -> Internet. These have to get mapped to the primary host IP so that GCE knows how to egress that traffic. There is actually 2 layers of NAT here: Container IP -> Internal Host IP -> External Host IP. The first level happens in the guest with IP tables and the second happens as part of GCE networking. The first one (Container IP -> internal host IP) does dynamic port allocation while the second maps ports 1:1.
|
100
|
+
|
101
|
+
3. Internet -> Container. This also has to go through the primary host IP and also has 2 levels of NAT, ideally. However, the path currently is a proxy with (External Host IP -> Internal Host IP -> Docker) -> (Docker -> Container IP). Once [issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15) is closed, it should be External Host IP -> Internal Host IP -> Container IP. But to get that second arrow we have to set up the port forwarding iptables rules per mapped port.
|
102
|
+
|
103
|
+
Another approach could be to create a new host interface alias for each pod, if we had a way to route an external IP to it. This would eliminate the scheduling constraints resulting from using the host's IP address.
|
104
|
+
|
105
|
+
### IPv6
|
106
|
+
|
107
|
+
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)
|
108
|
+
|
@@ -0,0 +1,14 @@
|
|
1
|
+
# Kubernetes OpenVSwitch GRE/VxLAN networking
|
2
|
+
|
3
|
+
This document describes how OpenVSwitch is used to setup networking between pods across minions.
|
4
|
+
The tunnel type could be GRE or VxLAN. VxLAN is preferable when large scale isolation needs to be performed within the network.
|
5
|
+
|
6
|
+
![ovs-networking](./ovs-networking.png "OVS Networking")
|
7
|
+
|
8
|
+
The vagrant setup in Kubernetes does the following:
|
9
|
+
|
10
|
+
The docker bridge is replaced with a brctl generated linux bridge (kbr0) with a 256 address space subnet. Basically, a node gets 10.244.x.0/24 subnet and docker is configured to use that bridge instead of the default docker0 bridge.
|
11
|
+
|
12
|
+
Also, an OVS bridge is created(obr0) and added as a port to the kbr0 bridge. All OVS bridges across all nodes are linked with GRE tunnels. So, each node has an outgoing GRE tunnel to all other nodes. It does not need to be a complete mesh really, just meshier the better. STP (spanning tree) mode is enabled in the bridges to prevent loops.
|
13
|
+
|
14
|
+
Routing rules enable any 10.244.0.0/16 target to become reachable via the OVS bridge connected with the tunnels.
|
Binary file
|
data/docs/pods.md
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
# Pods
|
2
|
+
|
3
|
+
A _pod_ (as in a pod of whales or pea pod) is a relatively tightly coupled group of containers that are scheduled onto the same host. It models an application-specific "virtual host" in a containerized environment. Pods serve as units of scheduling, deployment, and horizontal scaling/replication, and share fate.
|
4
|
+
|
5
|
+
Why doesn't Kubernetes just support an affinity mechanism for co-scheduling containers instead? While pods have a number of benefits (e.g., simplifying the scheduler), the primary motivation is resource sharing.
|
6
|
+
|
7
|
+
In addition to defining the containers that run in the pod, the pod specifies a set of shared storage volumes. Pods facilitate data sharing and IPC among their constituents. In the future, they may share CPU and/or memory ([LPC2013](http://www.linuxplumbersconf.org/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf)).
|
8
|
+
|
9
|
+
The containers in the pod also all use the same network namespace/IP (and port space). The goal is for each pod to have an IP address in a flat shared networking namespace that has full communication with other physical computers and containers across the network. [More details on networking](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/networking.md).
|
10
|
+
|
11
|
+
While pods can be used to host vertically integrated application stacks, their primary motivation is to support co-located, co-managed helper programs, such as:
|
12
|
+
- content management systems, file and data loaders, local cache managers, etc.
|
13
|
+
- log and checkpoint backup, compression, rotation, snapshotting, etc.
|
14
|
+
- data change watchers, log tailers, logging and monitoring adapters, event publishers, etc.
|
15
|
+
- proxies, bridges, and adapters
|
16
|
+
- controllers, managers, configurators, and updaters
|
17
|
+
|
18
|
+
Individual pods are not intended to run multiple instances of the same application, in general.
|
19
|
+
|
20
|
+
Why not just run multiple programs in a single Docker container?
|
21
|
+
|
22
|
+
1. Transparency. Making the containers within the pod visible to the infrastructure enables the infrastructure to provide services to those containers, such as process management and resource monitoring. This facilitates a number of conveniences for users.
|
23
|
+
2. Decoupling software dependencies. The individual containers may be rebuilt and redeployed independently. Kubernetes may even support live updates of individual containers someday.
|
data/docs/releasing.dot
ADDED
@@ -0,0 +1,113 @@
|
|
1
|
+
// Build it with:
|
2
|
+
// $ dot -Tsvg releasing.dot >releasing.svg
|
3
|
+
|
4
|
+
digraph tagged_release {
|
5
|
+
size = "5,5"
|
6
|
+
// Arrows go up.
|
7
|
+
rankdir = BT
|
8
|
+
subgraph left {
|
9
|
+
// Group the left nodes together.
|
10
|
+
ci012abc -> pr101 -> ci345cde -> pr102
|
11
|
+
style = invis
|
12
|
+
}
|
13
|
+
subgraph right {
|
14
|
+
// Group the right nodes together.
|
15
|
+
version_commit -> dev_commit
|
16
|
+
style = invis
|
17
|
+
}
|
18
|
+
{ // Align the version commit and the info about it.
|
19
|
+
rank = same
|
20
|
+
// Align them with pr101
|
21
|
+
pr101
|
22
|
+
version_commit
|
23
|
+
// release_info shows the change in the commit.
|
24
|
+
release_info
|
25
|
+
}
|
26
|
+
{ // Align the dev commit and the info about it.
|
27
|
+
rank = same
|
28
|
+
// Align them with 345cde
|
29
|
+
ci345cde
|
30
|
+
dev_commit
|
31
|
+
dev_info
|
32
|
+
}
|
33
|
+
// Join the nodes from subgraph left.
|
34
|
+
pr99 -> ci012abc
|
35
|
+
pr102 -> pr100
|
36
|
+
// Do the version node.
|
37
|
+
pr99 -> version_commit
|
38
|
+
dev_commit -> pr100
|
39
|
+
tag -> version_commit
|
40
|
+
pr99 [
|
41
|
+
label = "Merge PR #99"
|
42
|
+
shape = box
|
43
|
+
fillcolor = "#ccccff"
|
44
|
+
style = "filled"
|
45
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
46
|
+
];
|
47
|
+
ci012abc [
|
48
|
+
label = "012abc"
|
49
|
+
shape = circle
|
50
|
+
fillcolor = "#ffffcc"
|
51
|
+
style = "filled"
|
52
|
+
fontname = "Consolas, Liberation Mono, Menlo, Courier, monospace"
|
53
|
+
];
|
54
|
+
pr101 [
|
55
|
+
label = "Merge PR #101"
|
56
|
+
shape = box
|
57
|
+
fillcolor = "#ccccff"
|
58
|
+
style = "filled"
|
59
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
60
|
+
];
|
61
|
+
ci345cde [
|
62
|
+
label = "345cde"
|
63
|
+
shape = circle
|
64
|
+
fillcolor = "#ffffcc"
|
65
|
+
style = "filled"
|
66
|
+
fontname = "Consolas, Liberation Mono, Menlo, Courier, monospace"
|
67
|
+
];
|
68
|
+
pr102 [
|
69
|
+
label = "Merge PR #102"
|
70
|
+
shape = box
|
71
|
+
fillcolor = "#ccccff"
|
72
|
+
style = "filled"
|
73
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
74
|
+
];
|
75
|
+
version_commit [
|
76
|
+
label = "678fed"
|
77
|
+
shape = circle
|
78
|
+
fillcolor = "#ccffcc"
|
79
|
+
style = "filled"
|
80
|
+
fontname = "Consolas, Liberation Mono, Menlo, Courier, monospace"
|
81
|
+
];
|
82
|
+
dev_commit [
|
83
|
+
label = "456dcb"
|
84
|
+
shape = circle
|
85
|
+
fillcolor = "#ffffcc"
|
86
|
+
style = "filled"
|
87
|
+
fontname = "Consolas, Liberation Mono, Menlo, Courier, monospace"
|
88
|
+
];
|
89
|
+
pr100 [
|
90
|
+
label = "Merge PR #100"
|
91
|
+
shape = box
|
92
|
+
fillcolor = "#ccccff"
|
93
|
+
style = "filled"
|
94
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
95
|
+
];
|
96
|
+
release_info [
|
97
|
+
label = "pkg/version/base.go:\ngitVersion = \"v0.5\";"
|
98
|
+
shape = none
|
99
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
100
|
+
];
|
101
|
+
dev_info [
|
102
|
+
label = "pkg/version/base.go:\ngitVersion = \"v0.5-dev\";"
|
103
|
+
shape = none
|
104
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
105
|
+
];
|
106
|
+
tag [
|
107
|
+
label = "$ git tag -a v0.5"
|
108
|
+
fillcolor = "#ffcccc"
|
109
|
+
style = "filled"
|
110
|
+
fontname = "Helvetica Neue, Helvetica, Segoe UI, Arial, freesans, sans-serif"
|
111
|
+
];
|
112
|
+
}
|
113
|
+
|
data/docs/releasing.md
ADDED
@@ -0,0 +1,152 @@
|
|
1
|
+
# Releasing Kubernetes
|
2
|
+
|
3
|
+
This document explains how to create a Kubernetes release (as in version) and
|
4
|
+
how the version information gets embedded into the built binaries.
|
5
|
+
|
6
|
+
## Origin of the Sources
|
7
|
+
|
8
|
+
Kubernetes may be built from either a git tree (using `hack/build-go.sh`) or
|
9
|
+
from a tarball (using either `hack/build-go.sh` or `go install`) or directly by
|
10
|
+
the Go native build system (using `go get`).
|
11
|
+
|
12
|
+
When building from git, we want to be able to insert specific information about
|
13
|
+
the build tree at build time. In particular, we want to use the output of `git
|
14
|
+
describe` to generate the version of Kubernetes and the status of the build
|
15
|
+
tree (add a `-dirty` prefix if the tree was modified.)
|
16
|
+
|
17
|
+
When building from a tarball or using the Go build system, we will not have
|
18
|
+
access to the information about the git tree, but we still want to be able to
|
19
|
+
tell whether this build corresponds to an exact release (e.g. v0.3) or is
|
20
|
+
between releases (e.g. at some point in development between v0.3 and v0.4).
|
21
|
+
|
22
|
+
## Version Number Format
|
23
|
+
|
24
|
+
In order to account for these use cases, there are some specific formats that
|
25
|
+
may end up representing the Kubernetes version. Here are a few examples:
|
26
|
+
|
27
|
+
- **v0.5**: This is official version 0.5 and this version will only be used
|
28
|
+
when building from a clean git tree at the v0.5 git tag, or from a tree
|
29
|
+
extracted from the tarball corresponding to that specific release.
|
30
|
+
- **v0.5-15-g0123abcd4567**: This is the `git describe` output and it indicates
|
31
|
+
that we are 15 commits past the v0.5 release and that the SHA1 of the commit
|
32
|
+
where the binaries were built was `0123abcd4567`. It is only possible to have
|
33
|
+
this level of detail in the version information when building from git, not
|
34
|
+
when building from a tarball.
|
35
|
+
- **v0.5-15-g0123abcd4567-dirty** or **v0.5-dirty**: The extra `-dirty` prefix
|
36
|
+
means that the tree had local modifications or untracked files at the time of
|
37
|
+
the build, so there's no guarantee that the source code matches exactly the
|
38
|
+
state of the tree at the `0123abcd4567` commit or at the `v0.5` git tag
|
39
|
+
(resp.)
|
40
|
+
- **v0.5-dev**: This means we are building from a tarball or using `go get` or,
|
41
|
+
if we have a git tree, we are using `go install` directly, so it is not
|
42
|
+
possible to inject the git version into the build information. Additionally,
|
43
|
+
this is not an official release, so the `-dev` prefix indicates that the
|
44
|
+
version we are building is after `v0.5` but before `v0.6`. (There is actually
|
45
|
+
an exception where a commit with `v0.5-dev` is not present on `v0.6`, see
|
46
|
+
later for details.)
|
47
|
+
|
48
|
+
## Injecting Version into Binaries
|
49
|
+
|
50
|
+
In order to cover the different build cases, we start by providing information
|
51
|
+
that can be used when using only Go build tools or when we do not have the git
|
52
|
+
version information available.
|
53
|
+
|
54
|
+
To be able to provide a meaningful version in those cases, we set the contents
|
55
|
+
of variables in a Go source file that will be used when no overrides are
|
56
|
+
present.
|
57
|
+
|
58
|
+
We are using `pkg/version/base.go` as the source of versioning in absence of
|
59
|
+
information from git. Here is a sample of that file's contents:
|
60
|
+
|
61
|
+
```
|
62
|
+
var (
|
63
|
+
gitVersion string = "v0.4-dev" // version from git, output of $(git describe)
|
64
|
+
gitCommit string = "" // sha1 from git, output of $(git rev-parse HEAD)
|
65
|
+
)
|
66
|
+
```
|
67
|
+
|
68
|
+
This means a build with `go install` or `go get` or a build from a tarball will
|
69
|
+
yield binaries that will identify themselves as `v0.4-dev` and will not be able
|
70
|
+
to provide you with a SHA1.
|
71
|
+
|
72
|
+
To add the extra versioning information when building from git, the
|
73
|
+
`hack/build-go.sh` script will gather that information (using `git describe` and
|
74
|
+
`git rev-parse`) and then create a `-ldflags` string to pass to `go install` and
|
75
|
+
tell the Go linker to override the contents of those variables at build time. It
|
76
|
+
can, for instance, tell it to override `gitVersion` and set it to
|
77
|
+
`v0.4-13-g4567bcdef6789-dirty` and set `gitCommit` to `4567bcdef6789...` which
|
78
|
+
is the complete SHA1 of the (dirty) tree used at build time.
|
79
|
+
|
80
|
+
## Handling Official Versions
|
81
|
+
|
82
|
+
Handling official versions from git is easy, as long as there is an annotated
|
83
|
+
git tag pointing to a specific version then `git describe` will return that tag
|
84
|
+
exactly which will match the idea of an official version (e.g. `v0.5`).
|
85
|
+
|
86
|
+
Handling it on tarballs is a bit harder since the exact version string must be
|
87
|
+
present in `pkg/version/base.go` for it to get embedded into the binaries. But
|
88
|
+
simply creating a commit with `v0.5` on its own would mean that the commits
|
89
|
+
coming after it would also get the `v0.5` version when built from tarball or `go
|
90
|
+
get` while in fact they do not match `v0.5` (the one that was tagged) exactly.
|
91
|
+
|
92
|
+
To handle that case, creating a new release should involve creating two adjacent
|
93
|
+
commits where the first of them will set the version to `v0.5` and the second
|
94
|
+
will set it to `v0.5-dev`. In that case, even in the presence of merges, there
|
95
|
+
will be a single comit where the exact `v0.5` version will be used and all
|
96
|
+
others around it will either have `v0.4-dev` or `v0.5-dev`.
|
97
|
+
|
98
|
+
The diagram below illustrates it.
|
99
|
+
|
100
|
+
![Diagram of git commits involved in the release](releasing.png)
|
101
|
+
|
102
|
+
After working on `v0.4-dev` and merging PR 99 we decide it is time to release
|
103
|
+
`v0.5`. So we start a new branch, create one commit to update
|
104
|
+
`pkg/version/base.go` to include `gitVersion = "v0.5"` and `git commit` it.
|
105
|
+
|
106
|
+
We test it and make sure everything is working as expected.
|
107
|
+
|
108
|
+
Before sending a PR for it, we create a second commit on that same branch,
|
109
|
+
updating `pkg/version/base.go` to include `gitVersion = "v0.5-dev"`. That will
|
110
|
+
ensure that further builds (from tarball or `go install`) on that tree will
|
111
|
+
always include the `-dev` prefix and will not have a `v0.5` version (since they
|
112
|
+
do not match the official `v0.5` exactly.)
|
113
|
+
|
114
|
+
We then send PR 100 with both commits in it.
|
115
|
+
|
116
|
+
Once the PR is accepted, we can use `git tag -a` to create an annotated tag
|
117
|
+
*pointing to the one commit* that has `v0.5` in `pkg/version/base.go` and push
|
118
|
+
it to GitHub. (Unfortunately GitHub tags/releases are not annotated tags, so
|
119
|
+
this needs to be done from a git client and pushed to GitHub using SSH.)
|
120
|
+
|
121
|
+
## Parallel Commits
|
122
|
+
|
123
|
+
While we are working on releasing `v0.5`, other development takes place and
|
124
|
+
other PRs get merged. For instance, in the example above, PRs 101 and 102 get
|
125
|
+
merged to the master branch before the versioning PR gets merged.
|
126
|
+
|
127
|
+
This is not a problem, it is only slightly inaccurate that checking out the tree
|
128
|
+
at commit `012abc` or commit `345cde` or at the commit of the merges of PR 101
|
129
|
+
or 102 will yield a version of `v0.4-dev` *but* those commits are not present in
|
130
|
+
`v0.5`.
|
131
|
+
|
132
|
+
In that sense, there is a small window in which commits will get a
|
133
|
+
`v0.4-dev` or `v0.4-N-gXXX` label and while they're indeed later than `v0.4`
|
134
|
+
but they are not really before `v0.5` in that `v0.5` does not contain those
|
135
|
+
commits.
|
136
|
+
|
137
|
+
Unfortunately, there is not much we can do about it. On the other hand, other
|
138
|
+
projects seem to live with that and it does not really become a large problem.
|
139
|
+
|
140
|
+
As an example, Docker commit a327d9b91edf has a `v1.1.1-N-gXXX` label but it is
|
141
|
+
not present in Docker `v1.2.0`:
|
142
|
+
|
143
|
+
```
|
144
|
+
$ git describe a327d9b91edf
|
145
|
+
v1.1.1-822-ga327d9b91edf
|
146
|
+
|
147
|
+
$ git log --oneline v1.2.0..a327d9b91edf
|
148
|
+
a327d9b91edf Fix data space reporting from Kb/Mb to KB/MB
|
149
|
+
|
150
|
+
(Non-empty output here means the commit is not present on v1.2.0.)
|
151
|
+
```
|
152
|
+
|