OkHttp Sockettimeout 问题和优化方案

背景

OkHttp 在某些场景下,会出现SocketTimeout,然后一段时间内会有大量的超时情况(如果没有设置连接池,5min内(默认设置)),在网络统计的数据中,top的错误统计就是SocketTimeout。

这些问题的原因还需要从h2协议和连接池以及OkHttp的实现说起

OkHttp h2

OkHttp 实现h2的协议

  • 连接复用
  • 同一个连接允许多个stream

如下图:

  • h2是建立在tcp上层的,上层的多路stream,tcp拥有的特性,h2协议照样存在(比如对头阻塞等)
  • h2多个stream是在同一个连接,如果一个连接的请求数据出现丢包,那边该连接上的所以请求都会进行等待

如果该条连接上出现超时,换句话说,就是这个连接连接不上Server,但是网络库并不知道,还是将相同的host的请求添加到该连接,那么势必会造成后续加入的请求都是不可用的。

OkHttp ConnectionPool

连接池

  • 全局一个核心数为0,最大线程数量无限大的线程池
  • 缓存连接
  • 判断当前连接是否可用
  • 清理连接

获取连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

private final Deque<RealConnection> connections = new ArrayDeque<>();

/**
* Returns a recycled connection to {@code address}, or null if no such connection exists. The
* route is null if the address has not yet been routed.
*/
@Nullable RealConnection get(Address address, StreamAllocation streamAllocation, Route route) {
assert (Thread.holdsLock(this));
for (RealConnection connection : connections) {
if (connection.isEligible(address, route)) {
streamAllocation.acquire(connection, true);
return connection;
}
}
return null;
}

通过get这个方法,我们可以知道

  • 通过遍历connections,获取RealConnection
  • 然后通过RealConnection的isEligible方法判断当前的这个连接是否符合要求的,如果符合条件则返回这个连接,否则返回一个null的值

我们在看看isEligible(RealConnection.java)具体是做了哪些工作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

/** Current streams carried by this connection. */
public final List<Reference<StreamAllocation>> allocations = new ArrayList<>();

/**
* Returns true if this connection can carry a stream allocation to {@code address}. If non-null
* {@code route} is the resolved route for a connection.
*/
public boolean isEligible(Address address, @Nullable Route route) {
// If this connection is not accepting new streams, we're done.
if (allocations.size() >= allocationLimit || noNewStreams) return false;

// If the non-host fields of the address don't overlap, we're done.
if (!Internal.instance.equalsNonHost(this.route.address(), address)) return false;

// If the host exactly matches, we're done: this connection can carry the address.
if (address.url().host().equals(this.route().address().url().host())) {
return true; // This connection is a perfect match.
}

// At this point we don't have a hostname match. But we still be able to carry the request if
// our connection coalescing requirements are met. See also:
// https://hpbn.co/optimizing-application-delivery/#eliminate-domain-sharding
// https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/

// 1. This connection must be HTTP/2.
if (http2Connection == null) return false;

// 2. The routes must share an IP address. This requires us to have a DNS address for both
// hosts, which only happens after route planning. We can't coalesce connections that use a
// proxy, since proxies don't tell us the origin server's IP address.
if (route == null) return false;
if (route.proxy().type() != Proxy.Type.DIRECT) return false;
if (this.route.proxy().type() != Proxy.Type.DIRECT) return false;
if (!this.route.socketAddress().equals(route.socketAddress())) return false;

// 3. This connection's server certificate's must cover the new host.
if (route.address().hostnameVerifier() != OkHostnameVerifier.INSTANCE) return false;
if (!supportsUrl(address.url())) return false;

// 4. Certificate pinning must match the host.
try {
address.certificatePinner().check(address.url().host(), handshake().peerCertificates());
} catch (SSLPeerUnverifiedException e) {
return false;
}

return true; // The caller's address can be carried by this connection.
}

优先判断了以下的条件:

  1. 对于h2协议的连接 allocationLimit 是Integer.MAX_VALUE,所以h2协议的请求allocations.size() >= allocationLimit 一般情况下是false, noNewStreams时设置该Connection是否能够加入新的请求流,如果设置为true,那边该连接设置为不可用,在后续的连接清理时会被清理掉
  2. 判断当前的Connection的Address是否和参数中的Address是一直的值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

/*Address.java*/

boolean equalsNonHost(Address that) {
return this.dns.equals(that.dns)
&& this.proxyAuthenticator.equals(that.proxyAuthenticator)
&& this.protocols.equals(that.protocols)
&& this.connectionSpecs.equals(that.connectionSpecs)
&& this.proxySelector.equals(that.proxySelector)
&& equal(this.proxy, that.proxy)
&& equal(this.sslSocketFactory, that.sslSocketFactory)
&& equal(this.hostnameVerifier, that.hostnameVerifier)
&& equal(this.certificatePinner, that.certificatePinner)
&& this.url().port() == that.url().port();
}

  1. 判断当前Connection的host是否是同一个,如果是,结合1、2的条件判断,则这个连接就是复用的Connection

后续的条件就不详细分析了,可以自己详细看。

从连接池中获取一个可以复用的连接逻辑并不复杂,但是有几点需要注意,allocationLimit、noNewStreams这两个变量,控制着连接的可用性。

接下来,我们看下如何使用连接池的连接获取的方法,上层对这个连接做了哪些处理。

StreamAllocation 使用连接的逻辑

顾名思义StreamAllocation是叫做流分配。

  1. 创建一个流
  2. 将流分配到一个Connection上

我们关注点放在如何获取一个Connection上,先看下核心的代码逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
public HttpCodec newStream(
OkHttpClient client, Interceptor.Chain chain, boolean doExtensiveHealthChecks) {
int connectTimeout = chain.connectTimeoutMillis();
int readTimeout = chain.readTimeoutMillis();
int writeTimeout = chain.writeTimeoutMillis();
int pingIntervalMillis = client.pingIntervalMillis();
boolean connectionRetryEnabled = client.retryOnConnectionFailure();

try {
RealConnection resultConnection = findHealthyConnection(connectTimeout, readTimeout,
writeTimeout, pingIntervalMillis, connectionRetryEnabled, doExtensiveHealthChecks);
HttpCodec resultCodec = resultConnection.newCodec(client, chain, this);

synchronized (connectionPool) {
codec = resultCodec;
return resultCodec;
}
} catch (IOException e) {
throw new RouteException(e);
}
}


/**
* Finds a connection and returns it if it is healthy. If it is unhealthy the process is repeated
* until a healthy connection is found.
*/
private RealConnection findHealthyConnection(int connectTimeout, int readTimeout,
int writeTimeout, int pingIntervalMillis, boolean connectionRetryEnabled,
boolean doExtensiveHealthChecks) throws IOException {
while (true) {
RealConnection candidate = findConnection(connectTimeout, readTimeout, writeTimeout,
pingIntervalMillis, connectionRetryEnabled);

// If this is a brand new connection, we can skip the extensive health checks.
synchronized (connectionPool) {
if (candidate.successCount == 0) {
return candidate;
}
}

// Do a (potentially slow) check to confirm that the pooled connection is still good. If it
// isn't, take it out of the pool and start again.
if (!candidate.isHealthy(doExtensiveHealthChecks)) {
noNewStreams();
continue;
}

return candidate;
}
}

newStream的方法中调用findHealthyConnection查找一个健康可用的连接。findHealthyConnection通过findConnection来获取了一个候选的 candidate 的RealConnection,然后对这个RealConnection进行可用性检查。

  • candidate的successCount为0,这个是一个新创建的连接,这跳过检查
  • 如果不是新创建的连接,会进行健康检查candidate.isHealthy(doExtensiveHealthChecks),如果返回false,则调用noNewStreams()将RealConnection的noNewStreams设置为true

我们重点关注下isHealthy这个方法的判断逻辑,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

/** Returns true if this connection is ready to host new streams. */
public boolean isHealthy(boolean doExtensiveChecks) {
if (socket.isClosed() || socket.isInputShutdown() || socket.isOutputShutdown()) {
return false;
}

if (http2Connection != null) {
return !http2Connection.isShutdown();
}

if (doExtensiveChecks) {
try {
int readTimeout = socket.getSoTimeout();
try {
socket.setSoTimeout(1);
if (source.exhausted()) {
return false; // Stream is exhausted; socket is closed.
}
return true;
} finally {
socket.setSoTimeout(readTimeout);
}
} catch (SocketTimeoutException ignored) {
// Read timed out; socket is good.
} catch (IOException e) {
return false; // Couldn't read; socket is closed.
}
}

return true;
}

外部传的参数是true,那么就会做一个连接的全面检查。

  • 保存原来的读取超时时间,设置读超时为1ms,判断stream是否exhausted
    在此处,如果出现SocketTimeoutException,OkHttp则认为该连接是可用的,然而我们线上出现了大量的SocketTimeout的情况,而且在本地的测试环境中也能够遇到经常性的超时,所以我们对这块进行了改造。
  • 通过AB实验,我们能够看到优化后的实验方案数据要好很多

如何解决连接不可用

在网络监听处,产生SocketTimeout时,设置Connection的noNewStreams为true,如果更极致的话,就是反射调用ConnectionPool中 去除掉Connection