FAILED Full Pipeline Run

Robert_Isajuk · July 27, 2021, 1:41pm

Hello,
The setup on prem with CentOS7, requires proxy.
All under control and working, however, when I started test Pipeline run I got error:

Full training of EnglishTxtClassifier 2.0 launched - Run 6046ca85-e2f5-48ca-a151-76ee48b10cac
Full training of EnglishTxtClassifier 2.0 scheduled - Run 6046ca85-e2f5-48ca-a151-76ee48b10cac
Full training of EnglishTxtClassifier 2.0 failed - Run 6046ca85-e2f5-48ca-a151-76ee48b10cac
Error Details : Download failed due to - Request error: POST unix://localhost:80/images/create?fromImage=registry.replicated.com%2Fai-fabric%2Fenglishtextclassification&tag=2-gpu-train: 500, body: {"message":"Get https://registry.replicated.com/v2/: proxyconnect tcp: dial tcp: lookup http on 10.96.0.10:53: no such host"}

Nothing new. Obvious approach to connect not knowing proxy, confirmed by looking at kubectl logs -n aifabric ai-trainer-deployment-7fcbc6f97-ps66b
Million dollar question: Which pod or deployment tries to connect, so I can tell to the pod or deployment what are the Proxy parameters.
Knowing that, I will update my regular maintenance script.

Jeremy_Tederry · July 27, 2021, 1:55pm

Is the proxy correctly setup in ai-trainer-deployment?

Robert_Isajuk · July 27, 2021, 2:36pm

Hello Jeremy!
Double checked, and to be sure, I did rollout restart deployment as you taught me

Answer - yes, parameters are setup correctly. Even during redeployment I have seen that image was pulled successfully from replicated.com.

But still, Pipeline is with error.

Robert

Jeremy_Tederry · July 27, 2021, 2:51pm

Weird this pod is the one making the call, so we need to make sure that proxy is well defined and same for no proxy (to local rook-cph IP and docker registry IP).
Also to make sure that it works could you sh into this pod and try curl (or maybe ping) replicated or any website to validate that proxy is working there?
kubectl -n ai-fabric exec -it ai-trainer-deployment-<full-name> -- sh
and then curl -v command if available.
Also what are logs that you saw on the pod directly?

Robert_Isajuk · July 27, 2021, 2:57pm

Agree, pod is behaving in strange way…
kubectl describe -n aifabric pod ai-trainer-deployment-854445b899-29mzz

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m1s                   default-scheduler  Successfully assigned aifabric/ai-trainer-deployment-854445b899-29mzz to ai-center
  Normal   Pulling    6m59s                  kubelet            Pulling image "registry.replicated.com/ai-fabric/ai-trainer:v21.3.2"
  Normal   Pulled     6m56s                  kubelet            Successfully pulled image "registry.replicated.com/ai-fabric/ai-trainer:v21.3.2"
  Normal   Created    6m55s                  kubelet            Created container ai-trainer-deployment
  Normal   Started    6m54s                  kubelet            Started container ai-trainer-deployment
  Warning  Unhealthy  5m21s (x3 over 6m21s)  kubelet            Readiness probe failed: Get http://10.32.0.33:8086/ai-trainer/actuator/echo: dial tcp 10.32.0.33:8086: connect: connection refused

investigating connection refused piece (will enhance NO_Proxy with IP addresses soon):

[root@ai-center Crons]# kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz -- curl -v http://10.32.0.33:8086/ai-trainer/actuator/echo
* Uses proxy env variable NO_PROXY == '10.32.0.0/22,10.96.0.0/22,127.0.0.1,aif-core/unstable,.default,kotsadm-api-node,kubernetes,.kurl,.local,localhost,.monitoring,.rook-ceph,.svc,192.168.1.*,10.32.0.0/16,10.96.0.0/16,172.28.0.0/16,10.96.3.125,10.32.0.33,10.32.0.50'
*   Trying 10.32.0.33:8086...
* Connected to 10.32.0.33 (10.32.0.33) port 8086 (#0)
> GET /ai-trainer/actuator/echo HTTP/1.1
> Host: 10.32.0.33:8086
> User-Agent: curl/7.69.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, POST, DELETE, PUT, OPTIONS, PATCH
< Access-Control-Allow-Headers: *
< Access-Control-Allow-Credentials: false
< Access-Control-Max-Age: 600
< Vary: Origin
< Vary: Access-Control-Request-Method
< Vary: Access-Control-Request-Headers
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< Pragma: no-cache
< Expires: 0
< X-Frame-Options: DENY
< Content-Type: application/json
< Transfer-Encoding: chunked
< Date: Tue, 27 Jul 2021 14:56:24 GMT
<
* Connection #0 to host 10.32.0.33 left intact
**{"respCode":200,"respMsg":"OK"}**

Robert_Isajuk · July 27, 2021, 3:09pm

Bingo!
Question, but why?? everywhere else is all good. I think Razvan mentioned that proxy authorization is not supported, maybe this is what he had in mind.

[root@ai-center Crons]# kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz -- curl -v https://registry.replicated.com/v2/
* Uses proxy env variable NO_PROXY == '10.32.0.0/22,10.96.0.0/22,127.0.0.1,aif-core/unstable,.default,kotsadm-api-node,kubernetes,.kurl,.local,localhost,.monitoring,.rook-ceph,.svc,192.168.1.*,10.32.0.0/16,10.96.0.0/16,172.28.0.0/16,10.96.3.125,10.32.0.33,10.32.0.50'
* Uses proxy env variable HTTPS_PROXY == 'http://login:pass@192.109.190.88:8080'
* Unsupported proxy syntax in 'http://login:pass@192.109.190.88:8080'
* Closing connection -1
curl: (5) Unsupported proxy syntax in 'http://login:pass@192.109.190.88:8080'
command terminated with exit code 5

Robert_Isajuk · July 27, 2021, 3:19pm

one more finding.
This structure run from inside pod works: curl -x http://192.109.190.88:8080 --proxy-user login:pass -L http://mbank.pl

Jeremy_Tederry · July 27, 2021, 4:00pm

What if you curl directly a public address like https://www.google.com, do you still see the issue with unsupported syntax? Also with http://www.google.com?

Robert_Isajuk · July 27, 2021, 6:48pm

Depends.

From Host: curl -v www.google.com - works fine
From Nod HTTPS connection: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v https://registry.replicated.com/v2/ results in curl: (5) Unsupported proxy syntax
From Nod HTTP: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v http://www.google.com results in curl: (6) Could not resolve host: www.google.com
From Nod HTTP with proxy in curl: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v -x http://192.109.190.88:8080 --proxy-user login:pass http://www.google.com → ALL GOOD
From Nod HTTPS: kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz – curl -v -x http://192.109.190.88:8080 --proxy-user login:pass https://mbank.pl → ALL GOOD

Then from within the NOD:

[root@ai-center ~]# kubectl exec -ti -n aifabric ai-trainer-deployment-854445b899-29mzz -- sh
/app # env | grep -i proxy
AI_JWKS_PROXY_PORT_80_TCP_PROTO=tcp
HTTPS_PROXY=http://log:pass@192.109.190.88:8080
NO_PROXY=10.32.0.0/22,10.96.0.0/22,127.0.0.1,aif-core/unstable,.default,kotsadm-api-node,kubernetes,.kurl,.local,localhost,.monitoring,.rook-ceph,.svc,192.168.1.*,10.32.0.0/16,10.96.0.0/16,172.28.0.0/16,10.96.3.125,10.32.0.33,10.32.0.50
AI_JWKS_PROXY_PORT_80_TCP=tcp://10.96.1.175:80
AI_JWKS_PROXY_SERVICE_HOST=10.96.1.175
AI_JWKS_PROXY_SERVICE_PORT=80
AI_JWKS_PROXY_PORT=tcp://10.96.1.175:80
HTTP_PROXY=http://log:pass@192.109.190.88:8080
AI_JWKS_PROXY_PORT_80_TCP_ADDR=10.96.1.175
AI_JWKS_PROXY_PORT_80_TCP_PORT=80

I see that there is a AI_JWKS_PROXY … maybe they are somehow conflicting and I try to run proxy through proxy?

Robert_Isajuk · July 28, 2021, 5:14am

Idea to investigate - password contains # character, → encoding into %23 ?

Robert_Isajuk · July 28, 2021, 7:28am

That was the correct solution.

kubectl set env -n aifabric deployment/ai-trainer-deployment HTTP_PROXY= → here the # gets encoded with %23
kubectl -n aifabric rollout restart deployment ai-trainer-deployment
After pipeline restart, I see running download processes:

2021-07-28 07:26:30 [pool-15-thread-1] INFO  c.s.docker.client.LoggingPullHandler.progress - pull registry.replicated.com/ai-fabric/englishtextclassification:2-gpu-train: ProgressMessage{id=aa250b89dc96, status=Downloading, stream=null, error=null, progress=[============>                                      ]  64.87MB/255.8MB, progressDetail=ProgressDetail{current=64866793, start=null, total=255811220}}

Jeremy_Tederry · July 28, 2021, 7:31am

woo good job Robert!
Do you know why this wasn’t an issue for ai-deployer-deployment? Did you change the password in between or did you use a different method to modify it?

Robert_Isajuk · July 28, 2021, 7:38am

For the ai-deployer-deployment we used different password and user.
I followed this info: URL Encoding of Special Characters
First table was obvious, second “possibility of being misunderstood within URLs” - not so much. The problem did not ring a bell earlier since the used password is actually working w/o encoding. Looks like there is some specific precondition that increases the “possibility of being misunderstood within URLs” .

system · July 31, 2021, 7:38am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error Details : Pipeline failed due to ML Package Issue - ResturantFeedback AI Center question , ai_center	2	124	April 5, 2024
Pipeline failing AI Center question , ai_center	2	1142	May 7, 2021
Ai Fabric - Pipeline status Queued long Time and not Successful AI Center	3	1269	August 11, 2020
AI Center Pipeline is getting failed AI Center orchestrator , question	1	1460	December 11, 2021
Failed Training Pipeline: Retraining an older ML Package AI Center question , ai_center	1	664	November 9, 2022

Most Active Users - Yesterday
ashokkarale
anjani_priya
Dheerendra_vishwakarma
Parvathy
Aakash_Singh_Rawat
Luis_Fernando
bjorn2390
neco
pere
Shiva_Nikhil
More details...

FAILED Full Pipeline Run

Related Topics