Home Deploying IBM Watson NLP to Kubernetes using KServe Modelmesh
Post
Cancel

Deploying IBM Watson NLP to Kubernetes using KServe Modelmesh

In this blog, I will demonstrate how to deploy the Watson for NLP Library to Kubernetes using KServe Modelmesh.

For initial context, read my blog introducing IBM Watson for Embed.

For deployment to OpenShift, see this blog.

Introducing KServe

KServe is a standard model inference platform on k8s. It is built for highly scalable use cases and supports existing third party model servers and standard ML/DL model formats, or it can be extended to support additional runtimes like the Watson NLP runtime.

Modelmesh Serving is intended to further increase KServe’s scalability, especially when there are a large number of models which change frequently. It intelligently loads and unloads models into memory from from cloud object storage (COS), to strike a trade-off between responsiveness to users and computational footprint.

Install Kserve Modelmesh on Kubernetes

KServe Modelmesh requires etcd, S3 storage and optionally Knative and Istio.

Two approaches are available for installation:

I took the quick start approach and installed to an Kubernetes cluster with the following commands:

1
2
3
4
5
6
RELEASE=release-0.9
git clone -b $RELEASE --depth 1 --single-branch https://github.com/kserve/modelmesh-serving.git
cd modelmesh-serving
kubectl create namespace modelmesh-serving
kubectl config set-context --current --namespace=modelmesh-serving
./scripts/install.sh --namespace modelmesh-serving --quickstart

After the script completes, you will find these pods running:

1
2
3
4
5
6
kubectl get pods

NAME                                    READY   STATUS    RESTARTS   AGE
etcd                                    1/1     Running   0          76m
minio                                   1/1     Running   0          76m
modelmesh-controller-77b8bf999c-2knhf   1/1     Running   0          75m

By default, there are some default serving runtimes defined:

1
2
3
4
5
6
kubectl get servingruntimes

NAME           DISABLED   MODELTYPE     CONTAINERS   AGE
mlserver-0.x              sklearn       mlserver     4m11s
ovms-1.x                  openvino_ir   ovms         4m11s
triton-2.x                keras         triton       4m11s

The quick start installation has also created a secret with credentials for the local minIO object storage.

1
2
3
4
kubectl get secret/storage-config -n modelmesh-serving

NAME             TYPE     DATA   AGE
storage-config   Opaque   1      117m

The secret contains connection details for the “localMinIO” COS endpoint. This secret becomes important later when uploading the models to be served.

Create a Pull Secret and ServiceAccount

Ensure you have a trial key.

1
2
IBM_ENTITLEMENT_KEY=<your trial key>
kubectl create secret docker-registry ibm-entitlement-key --docker-server=cp.icr.io/cp --docker-username=cp --docker-password=$IBM_ENTITLEMENT_KEY

An example ServiceAccount is provided. Create a ServiceAccount that references the pull secret.

1
2
git clone https://github.com/deleeuwblue/watson-embed-demos.git
kubectl apply -f watson-embed-demos/nlp/modelmesh-serving/serviceaccount.yaml

Configure Modelmesh Serving to use this ServiceAccount, giving the controller access to the IBM entitled registry. Use the Kubernetes console to edit the Config and Storage->ConfigMap model-serving-config-defaults in the modelmesh-serving namespace.

Set serviceAccountName to pull-secret-sa. Also disable restProxy as this is not supported by Watson NLP:

1
2
3
4
5
6
7
8
9
10
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-serving-config
data:
  config.yaml: |
    #Sample config overrides
    serviceAccountName: pull-secret-sa
    restProxy:
      enabled: false

Restart the modelmesh-controller pod:

1
2
kubectl scale deployment/modelmesh-controller --replicas=0 --all
kubectl scale deployment/modelmesh-controller --replicas=1 --all

Patch all service accounts:

1
2
3
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "ibm-entitlement-key"}]}'
kubectl patch serviceaccount modelmesh -p '{"imagePullSecrets": [{"name": "ibm-entitlement-key"}]}'
kubectl patch serviceaccount modelmesh-controller -p '{"imagePullSecrets": [{"name": "ibm-entitlement-key"}]}'

Configure a ServingRuntime for Watson NLP

An example ServingRuntime resource is provided. The serving runtime defines the cp.icr.io/cp/ai/watson-nlp-runtime container image should be used to serve models that specify watson-nlp as their model format.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: watson-nlp-runtime
spec:
  containers:
  - env:
      - name: ACCEPT_LICENSE
        value: "true"
      - name: LOG_LEVEL
        value: info
      - name: CAPACITY
        value: "6000000000"
      - name: DEFAULT_MODEL_SIZE
        value: "500000000"
      - name: METRICS_PORT
        value: "2113"
    args:
      - --  
      - python3
      - -m
      - watson_runtime.grpc_server
    image: cp.icr.io/cp/ai/watson-nlp-runtime:1.0.20
    imagePullPolicy: IfNotPresent
    name: watson-nlp-runtime
 #   resources:
 #     limits:
 #       cpu: 2
 #       memory: 8Gi
 #     requests:
 #       cpu: 1
 #       memory: 8Gi
  grpcDataEndpoint: port:8085
  grpcEndpoint: port:8085
  multiModel: true
  storageHelper:
    disabled: false
  supportedModelFormats:
    - autoSelect: true
      name: watson-nlp

Create the ServingRuntime resource:

1
kubectl apply -f watson-embed-demos/nlp/modelmesh-serving/servingruntime.yaml --namespace modelmesh-serving

Now you see the new watson NLP serving runtime, in addition to those provided by default:

1
2
3
4
5
6
7
kubectl get servingruntimes

NAME                 DISABLED   MODELTYPE     CONTAINERS           AGE
mlserver-0.x                    sklearn       mlserver             7m6s
ovms-1.x                        openvino_ir   ovms                 7m6s
triton-2.x                      keras         triton               7m6s
watson-nlp-runtime              watson-nlp    watson-nlp-runtime   7s

Upload a pre-trained Watson NLP model to Cloud Object Storage

The next step is to upload a model to object storage. Watson NLP provides pre-trained models as containers, which are usually run as init containers to copy their data to a volume shared with the watson-nlp-runtime, see Deployments to Kubernetes using yaml files or helm charts. When using Modelmesh, the goal is to copy the model data to COS. To achieve this, we can run the model container as a k8s Job, where the model container is configured to write to COS instead of a local volume mount.

An example Job is provided which launches the model container for the Syntax model. The env variables which configure the model container to copy its data to COS, referencing the credentials from the localMinIO section of the storage-config secret, which is mounted as a volume.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
apiVersion: batch/v1
kind: Job
metadata:
  name: model-upload
  namespace: modelmesh-serving
spec:
  template:
    spec:
      containers:
        - name: syntax-izumo-en-stock
          image: cp.icr.io/cp/ai/watson-nlp_syntax_izumo_lang_en_stock:1.0.7
          env:
            - name: UPLOAD
              value: "true"
            - name: ACCEPT_LICENSE
              value: "true"
            - name: S3_CONFIG_FILE
              value: /storage-config/localMinIO
            - name: UPLOAD_PATH
              value: models
          volumeMounts:
            - mountPath: /storage-config
              name: storage-config
              readOnly: true
      volumes:
        - name: storage-config
          secret:
            defaultMode: 420
            secretName: storage-config
      restartPolicy: Never
  backoffLimit: 2

Create the Job:

1
kubectl apply -f watson-embed-demos/nlp/modelmesh-serving/job.yaml --namespace modelmesh-serving

Create a InferenceService for the Syntax model

Finally, an InferenceService CR needs to be created to make the model available via the watson-nlp Serving Runtime that we already created. This resource defines the location for model syntax-izumo-en in COS. It also specifies a modelFormat of watson-nlp which will associate the model with the watson-nlp-runtime serving runtime.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: syntax-izumo-en
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: watson-nlp
      storage:
        path: models/syntax_izumo_lang_en_stock
        key: localMinIO

Create the InferenceService:

1
kubectl apply -f watson-embed-demos/nlp/modelmesh-serving/inferenceservice.yaml

The status of the InferenceService can be verified:

1
2
3
4
kubectl get InferenceService

NAME              URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
syntax-izumo-en   grpc://modelmesh-serving.modelmesh-serving:8033   True   

Note, the watson-nlp-runtime container can take 5-10 minutes to download. Until this has completed, the InferenceService will show a status of false.

Test the model

The modelmesh-serving Service does not expose a REST port, only GRPC. Interacting with GRPC requires the proto files. They are published here. Enter the following commands to test the Syntax model using grpcurl:

1
kubectl port-forward service/modelmesh-serving 8033:8033

Open a second terminal and run commands:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
git clone https://github.com/IBM/ibm-watson-embed-clients
cd ibm-watson-embed-clients/watson_nlp/protos
grpcurl -plaintext -proto ./common-service.proto \
-H 'mm-vmodel-id: syntax-izumo-en' \
-d '
{
  "parsers": [
    "TOKEN"
  ],
  "rawDocument": {
    "text": "This is a test."
  }
}
' \
127.0.0.1:8033 watson.runtime.nlp.v1.NlpService.SyntaxPredict

The GRPC call is routed by the modelmesh-serving Service to the appropriate serving runtime pod for the model requested. Modelmesh ensures there are enough Serving Runtime pods to meet demand. The response from the watson-nlp-runtime should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
{
  "text": "This is a test.",
  "producerId": {
    "name": "Izumo Text Processing",
    "version": "0.0.1"
  },
  "tokens": [
    {
      "span": {
        "end": 4,
        "text": "This"
      }
    },
    {
      "span": {
        "begin": 5,
        "end": 7,
        "text": "is"
      }
    },
    {
      "span": {
        "begin": 8,
        "end": 9,
        "text": "a"
      }
    },
    {
      "span": {
        "begin": 10,
        "end": 14,
        "text": "test"
      }
    },
    {
      "span": {
        "begin": 14,
        "end": 15,
        "text": "."
      }
    }
  ],
  "sentences": [
    {
      "span": {
        "end": 15,
        "text": "This is a test."
      }
    }
  ],
  "paragraphs": [
    {
      "span": {
        "end": 15,
        "text": "This is a test."
      }
    }
  ]
}
This post is licensed under CC BY 4.0 by the author.
Disclaimer
The posts on this site are my own and don't necessarily represent my employer IBM's positions, strategies or opinions.
Contents