Sporadic connection reset while checking out svn externals

Setup is rcstack 5.1.1 CE. Migrated svn repos from collabnet subversion. In a low amount of cases (<5%), checking out an external fails with 502 Bad Gateway when connecting over Traefik or just connection reset when we tried direct connection over port 10020.

The error seems random. When we try again the checkout it most likely will succeed. The affected repo or external are also random. It can appear with different clients, both svnkit and Tortoise SVN.

Unfortunatly, there is no reliable way to reproduce the error.

I tried adding way more system resources. Ram or CPU don’t seem to be stressed at all.

Example error message on teamcity client with svnkit:

Failed to perform checkout on agent: Subversion: checkout failed: Subversion update_external problem for C:\BuildAgent\work\ef4b3fe54d807ab\folder1\externals\folder2\externals\bx: svn: E175002: Connection reset

Example error with Tortoise SVN:

svn: E1750002: Unexpexted HTTP status 502 'Bad Gateway' on '/repo/trunk/folder'

Logs on the server are not very helpful.
docker logs on rhodecode-1, vcsserver-1 or svn-1 only show successful connections with status codes 200, 207 and so on. But they are very verbose and I am not sure what I am looking for.

Any help is appreciated.

1 Like

Hi @kgu,

Thanks for reporting this. The team is checking the possible roots of the issue and will reach back to you in this topic to debug and solve it.

Thanks, Artem

the 502 from traefik would indicate either svn apache server, or gunicorn workers times out handling the requests.

can you check what are you values in config files named

gunicorn_conf_vcs.py
gunicorn_conf_rc.py

timeout =
graceful_timeout =

can you also check in svn-1 logs if there aren’t any errors for timeouts etc

1 Like

Hi, thanks for trying to help.

the values in both files are:

timeout = 21600
graceful_timeout = 21600

svn-1 logs show no entries at all for the time of occurance after greping out status codes 2xx and RhodeCode-Healthcheck.

The error occurs relatively fast in between succesful checkouts. Here are some logs from the client side with timestamps:

[2024-08-10 05:26:32,991]   INFO -      jetbrains.buildServer.VCS - Performing SVNKit agent-side checkout from http://svn.example.com:10020/... 17:58:30 +0200 to C:\BuildAgent\work\ef4b3fe54d807ab\project for build SLM Display SDK Version 4::SLM Display SDK Octave (Lab) Windows #142 {buildId=38180}
[2024-08-10 05:26:41,449]   INFO -      jetbrains.buildServer.VCS - Load revision for http://svn.example.com:10020/SLM_GI_Modules/trunk/project: 7211
[2024-08-10 05:26:56,395]   INFO -      jetbrains.buildServer.VCS - Load revision for http://svn.example.com:10020/Libs/trunk/fftw: 3712
[2024-08-10 05:27:07,121]   INFO -  jetbrains.buildServer.VCS.SVN - Checkout failure: Subversion: checkout failed: Subversion update_external problem for C:\BuildAgent\work\ef4b3fe54d807ab\project\externals\examplelibLibraryUIwx\externals\examplelibLibraryAutoConfig\externals\adl_sdk: svn: E175002: Connection reset
jetbrains.buildServer.vcs.VcsException: Subversion: checkout failed: Subversion update_external problem for C:\BuildAgent\work\ef4b3fe54d807ab\project\externals\examplelibLibraryUIwx\externals\examplelibLibraryAutoConfig\externals\adl_sdk: svn: E175002: Connection reset
	at jetbrains.buildServer.vcs.svn.SvnIncludeRuleUpdater$SvnSourceUpdater.checkoutSources(SvnIncludeRuleUpdater.java:176)
	at jetbrains.buildServer.vcs.svn.SvnIncludeRuleUpdater.process(SvnIncludeRuleUpdater.java:77)
	at jetbrains.buildServer.agent.impl.vcs.CheckoutByIncludeRuleAdapter$4.updateSources(CheckoutByIncludeRuleAdapter.java:103)
	at jetbrains.buildServer.agent.impl.vcs.AgentVcsManagerExImpl$CheckoutSupportImpl.updateSources(AgentVcsManagerExImpl.java:113)
	at jetbrains.buildServer.agent.impl.patch.ProjectSourcesOnAgent$1.run(ProjectSourcesOnAgent.java:165)
	at java.lang.Thread.run(Thread.java:748)
[2024-08-10 05:27:16,194]   INFO -      jetbrains.buildServer.VCS - Set default timeout for Perforce operations to 600 seconds
[2024-08-10 05:27:37,490]   INFO -      jetbrains.buildServer.VCS - Performing SVNKit agent-side checkout from http://svn.example.com:10020/SomeFolder/trunk@1854_2024/08/09 10:08:29 +0200 to C:\BuildAgent\work\1b976139c178da22 for build Somebuild Win64 #730 {buildId=38184}
[2024-08-10 05:27:42,470]   INFO -      jetbrains.buildServer.VCS - Load revision for http://svn.example.com:10020/SomeFolder/trunk: 1854
[2024-08-10 05:27:52,534]   INFO -      jetbrains.buildServer.VCS - Load revision for https://svn.example.com/Libs/trunk/fftw: 3711
[2024-08-10 05:32:41,604]   INFO -      jetbrains.buildServer.VCS - Set default timeout for Perforce operations to 600 seconds
[2024-08-10 05:32:57,049]   INFO -      jetbrains.buildServer.VCS - Performing SVNKit agent-side checkout from http://svn.example.com:10020/SomeFolder/trunk@1854_2024/08/09 10:08:29 +0200 to C:\BuildAgent\work\1b976139c178da22 for build Somebuild (Lab) Win64 #695 {buildId=38183}
[2024-08-10 05:33:01,429]   INFO -      jetbrains.buildServer.VCS - Load revision for http://svn.example.com:10020/SomeFolder/trunk: 1854
[2024-08-10 05:33:10,892]   INFO -      jetbrains.buildServer.VCS - Load revision for https://svn.example.com/Libs/trunk/fftw: 3711
[2024-08-10 05:33:42,136]   INFO -  jetbrains.buildServer.VCS.SVN - Checkout failure: Subversion: checkout failed: Subversion update_external problem for C:\BuildAgent\work\1b976139c178da22\Tools\examplePowerUpTray\externals\example_stdlib: svn: E175002: PROPFIND of '/Libs/trunk/example_stdlib': 502 Bad Gateway (https://svn.example.com)
jetbrains.buildServer.vcs.VcsException: Subversion: checkout failed: Subversion update_external problem for C:\BuildAgent\work\1b976139c178da22\Tools\exampleToolexternals\example_stdlib: svn: E175002: PROPFIND of '/Libs/trunk/example_stdlib': 502 Bad Gateway (https://svn.example.com)
	at jetbrains.buildServer.vcs.svn.SvnIncludeRuleUpdater$SvnSourceUpdater.checkoutSources(SvnIncludeRuleUpdater.java:176)
	at jetbrains.buildServer.vcs.svn.SvnIncludeRuleUpdater.process(SvnIncludeRuleUpdater.java:77)
	at jetbrains.buildServer.agent.impl.vcs.CheckoutByIncludeRuleAdapter$4.updateSources(CheckoutByIncludeRuleAdapter.java:103)
	at jetbrains.buildServer.agent.impl.vcs.AgentVcsManagerExImpl$CheckoutSupportImpl.updateSources(AgentVcsManagerExImpl.java:113)
	at jetbrains.buildServer.agent.impl.patch.ProjectSourcesOnAgent$1.run(ProjectSourcesOnAgent.java:165)
	at java.lang.Thread.run(Thread.java:748)
[2024-08-10 05:33:52,696]   INFO -      jetbrains.buildServer.VCS - Set default timeout for Perforce operations to 600 seconds

Also seen here, that it makes no difference if we connect over http://svn.example.com:10020 directly to rhodecode or to https://svn.example.com:443 over Traefik.

can you replace the gunicorn config attached with the one under config/_shared/gunicorn_conf_rc.py directory ?

It enables some more logging in docker, so maybe we can catch 50X errors from gunicorn itself.

gunicorn_config.py (17.7 KB)

I replaced the config file, re-triggered the error. In the text below are the errors of the svn-client, traefik-1, svn-1 and rhodecode-1. domain name is replaced.

502 in traefik relates to a 207 in svn-1 relates to a 207 in rhodecode-1 with “view_name”: “NOT_FOUND”. Not sure how relevant that view_name is. That happens a lot in the logs.

Maybe relevant: I mounted existing repo folder as repo_store and did need to do a chown -R 999:999 .

#teamcity
 Failed to perform checkout on agent: Error during update from https://svn.example.cm/SLM_GI_Modules/trunk/SLMDisplaySDK to C:\BuildAgent\work\e8f914cae9a8dd32\SLMDisplaySDK: svn: E175002: PROPFIND of '/SLM_GI_Modules/!svn/vcc/default': 502 Bad Gateway (https://svn.example.cm


#docker logs rc_cluster_router-traefik-1
192.168.11.37 - - [15/Aug/2024:11:11:37 +0000] "PROPFIND /SLM_GI_Modules/!svn/vcc/default HTTP/1.1" 502 11 "-" "-" 19239 "https-rhodecode@docker" "http://172.19.0.16:10020" 0ms

#docker logs svn-1

{ "time":"2024-08-15T11:11:37.501Z", "process":"401", "filename":"dav_svn:/var/opt/rhodecode_repo_store/SLM_GI_Modules", "remoteIP":"172.19.0.16", "host":"svn.example.cm", "request":"/SLM_GI_Modules/!svn/vcc/default", "query":"", "method":"PROPFIND", "status":"207", "userAgent":"SVN/1.10.0 SVNKit/1.10.10 (http://svnkit.com/) r10865", "referer":"-" }


#docker logs rhodecode-1

{"timestamp": "2024-08-15T11:11:37.502826+00:00", "levelname": "INFO", "name": "rhodecode.lib.middleware.request_wrapper", "message": "Req[2266] <AuthUser('id:0[ANONYMOUS] ip:192.168.11.37 auth:False')> PROPFIND Request to /SLM_GI_Modules/!svn/vcc/default time: 0.0356s [SVN/1.10.0 SVNKit/1.10.10 (http://svnkit.com/) r10865], RhodeCode 5.1.1", "req_id": "158ef050-2e33-48e1-9cac-6a245e621554", "time": 0.035616397857666016, "ver": "5.1.1", "ip": "192.168.11.37", "path": "/SLM_GI_Modules/!svn/vcc/default", "view_name": "NOT_FOUND", "code": 207}
{"timestamp": "2024-08-15T11:11:37.519856+00:00", "levelname": "INFO", "name": "rhodecode.lib.middleware.simplevcs", "message": "Access for IP:192.168.11.37 allowed", "req_id": "3a33e301-6d74-4585-966f-9f4c2500500b"}
{"timestamp": "2024-08-15T11:11:37.524405+00:00", "levelname": "INFO", "name": "rhodecode.authentication.base", "message": "Authenticating user `buildserver` using egg:rhodecode-enterprise-ce#token plugin", "req_id": "3a33e301-6d74-4585-966f-9f4c2500500b"}

# This authfailure here happens a lot. Don't know if this is normal. Root problem is still present with local account instead of LDAP

{"timestamp": "2024-08-15T11:11:37.528013+00:00", "levelname": "WARNING", "name": "rhodecode.authentication.plugins.auth_token", "message": "user `buildserver` failed to authenticate via authtoken, reason: bad or inactive token.", "req_id": "3a33e301-6d74-4585-966f-9f4c2500500b"}
{"timestamp": "2024-08-15T11:11:37.531498+00:00", "levelname": "INFO", "name": "rhodecode.authentication.base", "message": "Authenticating user `buildserver` using egg:rhodecode-enterprise-ce#ldap plugin", "req_id": "3a33e301-6d74-4585-966f-9f4c2500500b"}
{"timestamp": "2024-08-15T11:11:37.532109+00:00", "levelname": "INFO", "name": "rhodecode.lib.middleware.simplevcs", "message": "MAIN-AUTH successful for user `buildserver` from ldap plugin", "req_id": "3a33e301-6d74-4585-966f-9f4c2500500b"}

So one theory we have is that SVN PROPFIND uses a lot of concurrent connections, thus http 207 status code. But maybe there aren’t enough workers to handle the connection thus traefik reports 502.

Could you try to scale up svn/ce/vcs server number of pods to higher numbers ?

1 Like

I raised the number of replicas in docker-compose-apps.override.yaml to 6 each for rhodecode, vcsserver and svn.

Unfortunately the error happened again when doing the big checkout.

#traefik
192.168.11.37 - - [16/Aug/2024:12:20:07 +0000] "PROPFIND /Libs/!svn/bc/3483/trunk/example_jondoe HTTP/1.1" 502 11 "-" "-" 43798 "https-rhodecode@docker" "http://172.19.0.16:10020" 2ms
#svn
svn-2          | { "time":"2024-08-16T12:20:07.863Z", "process":"797", "filename":"dav_svn:/var/opt/rhodecode_repo_store/Libs/trunk/example_jondoe", "remoteIP":"172.19.0.22", "host":"svn.example.com", "request":"/Libs/!svn/bc/3483/trunk/example_jondoe", "query":"", "method":"PROPFIND", "status":"207", "userAgent":"SVN/1.10.0 SVNKit/1.10.10 (http://svnkit.com/) r10865", "referer":"-" }

I also moved the whole ubuntu vm with the stack to a more powerful server, but no change. apache httpd also spikes to 100% while checking out.

Since the problem is only relevant when doing really big checkouts from a the buildserver at night: Might it be possible to connect directly to apache? Maybe with basic auth if the credentials are handled on the rhodecode layer. Since it’s only one server and not users, who can still use rhodecode, that might be acceptable.

We wouldn’t recommend doing direct apache connection since it would bypass a lot of internals of rhodecode, and you lose track of pull logic for repositories.

Do you have any specific time that the 502 error happens ? After e.g 5 minutes etc ?

also something worth mentioning, do you use SSL ?
Does generated apache config has this line in it ?

RequestHeader edit Destination ^https http early

1 Like

There seems to be no pattern to the error. To test if a configuration change we did earlier improves the reliability, I run a build-all-software-script and force a delete of all local source code. That will download a lot of data from svn, with lots of externals. This test might even succeed without error sometimes, fail the 2nd try, fail with a different repo/external the third try, succeed the fourth try. It’s like having low percentage failure chance and we need to lots of checkout to see it. That’s why it’s mainly a problem with this build-all-all-software job.

We use ssl with a let’s encrypt cert generated by traefik. But one of the first things I tried was connection directly to rhodecode-1 over Port 10020, removing Traefik from the equation. In a previous post you can see the error occuring both with https over traefik and with http without traefik.

Yes, the line
RequestHeader edit Destination ^https: http: early
is present in /etc/rhodecode/conf/svn/mod_dav_svn.conf

Trefik is unlikely the cause.

The way how SVN protocol works is following

svn-client → traefik → rhodecode-web (gunicorn) → mod_dav_apache

The 502 reporter is either from gunicorn:

  • request timeout (but we see 21600 seconds there so unlikely)
  • worker kill, maybe out-of-memory ?

Or is either from mod_dav_apache

  • request too big
  • timeout

You COULD bypass rhodecode-web and connect directly to apache by modifying the configuration of mod_dav_svn and remote authentication, then checking a clone directly into the mod-dav-apache, however this would probably be only working via localhost so probably much faster and potentially harder to reproduce.

given the random nature of the problem, we maybe thinks it’s memory issue.
During the test can you monitor docker instances using docker stats, and see if any of those don’t go close to memory limit ?

1 Like