tensorflow-on-kubeflow

需要先弄好kubeflow 主要是tf-operator模块,其他模块都不重要

先训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: mnist-simple-cpu-dist
spec:
replicaSpecs:
- replicas: 1 # 1 Master
tfReplicaType: MASTER
template:
spec:
containers:
- image: tf-mnist-distributed
name: tensorflow
env:
- name: TEST_TMPDIR
value: /training
command: ["python", "/app/main.py"]
volumeMounts:
- name: kubeflow-dist-nas-mnist
mountPath: "/training"
volumes:
- name: kubeflow-dist-nas-mnist
hostPath:
path: /training
restartPolicy: OnFailure
nodeSelector:
kubernetes.io/hostname : 10.8.64.200
- replicas: 1 # 1 or 2 Workers depends on how many cpus you have
tfReplicaType: WORKER
template:
spec:
containers:
- image: tf-mnist-distributed
name: tensorflow
env:
- name: TEST_TMPDIR
value: /training
command: ["python", "/app/main.py"]
imagePullPolicy: Always
volumeMounts:
- name: kubeflow-dist-nas-mnist
mountPath: "/training"
volumes:
- name: kubeflow-dist-nas-mnist
hostPath:
path: /training
restartPolicy: OnFailure
nodeSelector:
kubernetes.io/hostname : 10.8.64.200
- replicas: 1 # 1 Parameter server
tfReplicaType: PS
template:
spec:
containers:
- image: tf-mnist-distributed
name: tensorflow
command: ["python", "/app/main.py"]
env:
- name: TEST_TMPDIR
value: /training
imagePullPolicy: Always
volumeMounts:
- name: kubeflow-dist-nas-mnist
mountPath: "/training"
volumes:
- name: kubeflow-dist-nas-mnist
hostPath:
path: /training
restartPolicy: OnFailure
nodeSelector:
kubernetes.io/hostname : 10.8.64.200

训练完跑导出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: export-mnist-model
spec:
replicaSpecs:
- replicas: 1 # 1 Master
template:
spec:
containers:
- image: export-mnist-model
name: tensorflow
command: ["python", "/app/export_model.py"]
args:
- --model_version=1
- --checkpoint_path=/training/tensorflow/logs/
- /serving/mnist
volumeMounts:
- name: kubeflow-dist-nas-mnist
mountPath: "/training"
- name: tf-serving-pvc
mountPath: "/serving"
volumes:
- name: kubeflow-dist-nas-mnist
hostPath:
path: /training
- name: tf-serving-pvc
hostPath:
path: /serving
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname : 10.8.64.200

tensorboard

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: tensorboard
name: tensorboard
spec:
replicas: 1
selector:
matchLabels:
app: tensorboard
template:
metadata:
labels:
app: tensorboard
spec:
volumes:
- name: kubeflow-dist-nas-mnist
hostPath:
path: /training
nodeSelector:
kubernetes.io/hostname : 10.8.64.200
containers:
- name: tensorboard
image: tensorflow/tensorflow:1.7.0
imagePullPolicy: Always
command:
- /usr/local/bin/tensorboard
args:
- --logdir
- /training/tensorflow/logs
volumeMounts:
- name: kubeflow-dist-nas-mnist
mountPath: "/training"
ports:
- containerPort: 6006
protocol: TCP
dnsPolicy: ClusterFirst
restartPolicy: Always

导出模型后跑serving

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: mnist
name: mnist-v1
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: mnist
version: v1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: mnist
version: v1
spec:
containers:
- args:
- /usr/bin/tensorflow_model_server
- --port=9000
- --model_name=mnist
- --model_base_path=/mnt/mnist
image: tensorflow-serving-1.7
imagePullPolicy: IfNotPresent
name: mnist
ports:
- containerPort: 9000
protocol: TCP
resources:
limits:
cpu: "4"
memory: 4Gi
requests:
cpu: "1"
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt
name: nfs
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/hostname: 10.8.64.200
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /serving
type: ""
name: nfs