FAILED Full Pipeline Run

Hello,
The setup on prem with CentOS7, requires proxy.
All under control and working, however, when I started test Pipeline run I got error:

Full training of EnglishTxtClassifier 2.0 launched - Run 6046ca85-e2f5-48ca-a151-76ee48b10cac
Full training of EnglishTxtClassifier 2.0 scheduled - Run 6046ca85-e2f5-48ca-a151-76ee48b10cac
Full training of EnglishTxtClassifier 2.0 failed - Run 6046ca85-e2f5-48ca-a151-76ee48b10cac
Error Details : Download failed due to - Request error: POST unix://localhost:80/images/create?fromImage=registry.replicated.com%2Fai-fabric%2Fenglishtextclassification&tag=2-gpu-train: 500, body: {"message":"Get https://registry.replicated.com/v2/: proxyconnect tcp: dial tcp: lookup http on 10.96.0.10:53: no such host"}

Nothing new. Obvious approach to connect not knowing proxy, confirmed by looking at kubectl logs -n aifabric ai-trainer-deployment-7fcbc6f97-ps66b
Million dollar question: Which pod or deployment tries to connect, so I can tell to the pod or deployment what are the Proxy parameters.
Knowing that, I will update my regular maintenance script.

Is the proxy correctly setup in ai-trainer-deployment?

Hello Jeremy!
Double checked, and to be sure, I did rollout restart deployment as you taught me :slight_smile:

Answer - yes, parameters are setup correctly. Even during redeployment I have seen that image was pulled successfully from replicated.com.

But still, Pipeline is with error.

Robert

Weird this pod is the one making the call, so we need to make sure that proxy is well defined and same for no proxy (to local rook-cph IP and docker registry IP).
Also to make sure that it works could you sh into this pod and try curl (or maybe ping) replicated or any website to validate that proxy is working there?
kubectl -n ai-fabric exec -it ai-trainer-deployment-<full-name> -- sh
and then curl -v command if available.
Also what are logs that you saw on the pod directly?

Agree, pod is behaving in strange way…
kubectl describe -n aifabric pod ai-trainer-deployment-854445b899-29mzz

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m1s                   default-scheduler  Successfully assigned aifabric/ai-trainer-deployment-854445b899-29mzz to ai-center
  Normal   Pulling    6m59s                  kubelet            Pulling image "registry.replicated.com/ai-fabric/ai-trainer:v21.3.2"
  Normal   Pulled     6m56s                  kubelet            Successfully pulled image "registry.replicated.com/ai-fabric/ai-trainer:v21.3.2"
  Normal   Created    6m55s                  kubelet            Created container ai-trainer-deployment
  Normal   Started    6m54s                  kubelet            Started container ai-trainer-deployment
  Warning  Unhealthy  5m21s (x3 over 6m21s)  kubelet            Readiness probe failed: Get http://10.32.0.33:8086/ai-trainer/actuator/echo: dial tcp 10.32.0.33:8086: connect: connection refused

investigating connection refused piece (will enhance NO_Proxy with IP addresses soon):

[root@ai-center Crons]# kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz -- curl -v http://10.32.0.33:8086/ai-trainer/actuator/echo
* Uses proxy env variable NO_PROXY == '10.32.0.0/22,10.96.0.0/22,127.0.0.1,aif-core/unstable,.default,kotsadm-api-node,kubernetes,.kurl,.local,localhost,.monitoring,.rook-ceph,.svc,192.168.1.*,10.32.0.0/16,10.96.0.0/16,172.28.0.0/16,10.96.3.125,10.32.0.33,10.32.0.50'
*   Trying 10.32.0.33:8086...
* Connected to 10.32.0.33 (10.32.0.33) port 8086 (#0)
> GET /ai-trainer/actuator/echo HTTP/1.1
> Host: 10.32.0.33:8086
> User-Agent: curl/7.69.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, POST, DELETE, PUT, OPTIONS, PATCH
< Access-Control-Allow-Headers: *
< Access-Control-Allow-Credentials: false
< Access-Control-Max-Age: 600
< Vary: Origin
< Vary: Access-Control-Request-Method
< Vary: Access-Control-Request-Headers
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< Pragma: no-cache
< Expires: 0
< X-Frame-Options: DENY
< Content-Type: application/json
< Transfer-Encoding: chunked
< Date: Tue, 27 Jul 2021 14:56:24 GMT
<
* Connection #0 to host 10.32.0.33 left intact
**{"respCode":200,"respMsg":"OK"}**

Bingo!
Question, but why?? everywhere else is all good. I think Razvan mentioned that proxy authorization is not supported, maybe this is what he had in mind.

[root@ai-center Crons]# kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz -- curl -v https://registry.replicated.com/v2/
* Uses proxy env variable NO_PROXY == '10.32.0.0/22,10.96.0.0/22,127.0.0.1,aif-core/unstable,.default,kotsadm-api-node,kubernetes,.kurl,.local,localhost,.monitoring,.rook-ceph,.svc,192.168.1.*,10.32.0.0/16,10.96.0.0/16,172.28.0.0/16,10.96.3.125,10.32.0.33,10.32.0.50'
* Uses proxy env variable HTTPS_PROXY == 'http://login:pass@192.109.190.88:8080'
* Unsupported proxy syntax in 'http://login:pass@192.109.190.88:8080'
* Closing connection -1
curl: (5) Unsupported proxy syntax in 'http://login:pass@192.109.190.88:8080'
command terminated with exit code 5

one more finding.
This structure run from inside pod works: curl -x http://192.109.190.88:8080 --proxy-user login:pass -L http://mbank.pl

What if you curl directly a public address like https://www.google.com, do you still see the issue with unsupported syntax? Also with http://www.google.com?

Depends.

  1. From Host: curl -v www.google.com - works fine
  2. From Nod HTTPS connection: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v https://registry.replicated.com/v2/ results in curl: (5) Unsupported proxy syntax
  3. From Nod HTTP: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v http://www.google.com results in curl: (6) Could not resolve host: www.google.com
  4. From Nod HTTP with proxy in curl: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v -x http://192.109.190.88:8080 --proxy-user login:pass http://www.google.com → ALL GOOD
  5. From Nod HTTPS: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v -x http://192.109.190.88:8080 --proxy-user login:pass https://mbank.pl → ALL GOOD

Then from within the NOD:

[root@ai-center ~]# kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz -- sh
/app # env | grep -i proxy
AI_JWKS_PROXY_PORT_80_TCP_PROTO=tcp
HTTPS_PROXY=http://log:pass@192.109.190.88:8080
NO_PROXY=10.32.0.0/22,10.96.0.0/22,127.0.0.1,aif-core/unstable,.default,kotsadm-api-node,kubernetes,.kurl,.local,localhost,.monitoring,.rook-ceph,.svc,192.168.1.*,10.32.0.0/16,10.96.0.0/16,172.28.0.0/16,10.96.3.125,10.32.0.33,10.32.0.50
AI_JWKS_PROXY_PORT_80_TCP=tcp://10.96.1.175:80
AI_JWKS_PROXY_SERVICE_HOST=10.96.1.175
AI_JWKS_PROXY_SERVICE_PORT=80
AI_JWKS_PROXY_PORT=tcp://10.96.1.175:80
HTTP_PROXY=http://log:pass@192.109.190.88:8080
AI_JWKS_PROXY_PORT_80_TCP_ADDR=10.96.1.175
AI_JWKS_PROXY_PORT_80_TCP_PORT=80

I see that there is a AI_JWKS_PROXY … maybe they are somehow conflicting and I try to run proxy through proxy?

Idea to investigate - password contains # character, → encoding into %23 ?

That was the correct solution.

  1. kubectl set env -n aifabric deployment/ai-trainer-deployment HTTP_PROXY= → here the # gets encoded with %23
  2. kubectl -n aifabric rollout restart deployment ai-trainer-deployment
  3. After pipeline restart, I see running download processes:
2021-07-28 07:26:30 [pool-15-thread-1] INFO  c.s.docker.client.LoggingPullHandler.progress - pull registry.replicated.com/ai-fabric/englishtextclassification:2-gpu-train: ProgressMessage{id=aa250b89dc96, status=Downloading, stream=null, error=null, progress=[============>                                      ]  64.87MB/255.8MB, progressDetail=ProgressDetail{current=64866793, start=null, total=255811220}}

1 Like

woo good job Robert!
Do you know why this wasn’t an issue for ai-deployer-deployment? Did you change the password in between or did you use a different method to modify it?

For the ai-deployer-deployment we used different password and user.
I followed this info: URL Encoding of Special Characters
First table was obvious, second “possibility of being misunderstood within URLs” - not so much. The problem did not ring a bell earlier since the used password is actually working w/o encoding. Looks like there is some specific precondition that increases the “possibility of being misunderstood within URLs” .

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.