Cloud Foundry How Tos - Writing internet ready clients - Typical problems¶

Background¶

The communication over the internet is not reliable, due to the following reasons:

breaking of connections
temporary overloading of servers pr unavailable servers
slow internet

Host not reachable¶

Problem¶

This might often be caused by a change in the IP address and the client not updating that change.

Recommendation¶

Check the DNS resolution behavior of your runtime environment like e.g. Java or LUA, both may have own and separate strategies.

The DNS strategy should follow the DNS TTL of the DNS records being sent. As a good practice, refer the below steps:

maximum TTL of 60 s
retrying to resolve this DNS, if the host is not reachable

For example, Java by default uses a DNS entry. For information, refer Setting the JVM TTL for DNS Name Lookups.

AWS DNS entries typically have a TTL of 1 min only, so the IP address should be checked every minute.
If the connection to a server is not established, the DNS entry is automatically refreshed. This is because, the connection might not have been established due to the IP address change.

Timeout while connecting to the server¶

Problem¶

Connection timeout (typically 10 s), while connecting to the server.

Recommendation¶

When the connection timeout occurs, retry the operation immediately and repeat the same after few seconds.

The below examples depict the strategy examples.

Getting a connection reset (or HttpNoResponseException) while accessing the server¶

Problem¶

The client might get a connection reset message or HttpNoResponseException or a SSL connection error while trying to send data to the server.

This could especially happen, if:

The client is using HTTP keep alive with a long or unlimited timeout
The server closed this connection before the client closes it, while being idle for some time due to a defined idle timeout or also due to high load (many connections)
The client does not handle this properly. For example, in Java, the connection might get closed, but the higher level Java code is not aware of that. So, it is trying to reuse the closed connection for the next request => exception, but the connection is removed from the connection pool. so next try will work

Recommendation¶

The following are the recommendations for the programming language and http client implementation.

Example Java:

https://hc.apache.org/httpclient-3.x/exception-handling.html

In some circumstances, usually when under heavy load, the web server may be able to receive requests but unable to process them. A lack of sufficient resources like worker threads is a good example. This may cause the server to drop the connection to the client without giving any response. HttpClient throws NoHttpResponseException when it encounters such a condition. In most cases it is safe to retry a method that failed with NoHttpResponseException.

https://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception

https://issues.apache.org/jira/browse/HTTPCLIENT-1610

The issue could be resolved using:

httpBuilder.setRetryHandler(new HttpRequestRetryHandler()
{
    @Override
    public boolean retryRequest(IOException exception, int executionCount, HttpContext context)
    {
        if (executionCount > MAX_REDIRECT_ATTEMPTS)
        {
            return false;
        }
        if (exception instanceof org.apache.http.NoHttpResponseException)
        {
            return true;
        }
        return false;
    }
});

https://hc.apache.org/httpcomponents-client-4.5.x/httpclient/apidocs/org/apache/http/impl/client/StandardHttpRequestRetryHandler.html

HttpRequestRetryHandler which assumes that all requested HTTP methods which should be idempotent according to RFC-2616 are in fact idempotent and can be retried.

According to RFC-2616 section 9.1.2 the idempotent HTTP methods are:
GET, HEAD, PUT, DELETE, OPTIONS and TRACE

Challenge non-idempotent HTTP methods¶

While GET, PUT, DELETE etc can be retried without any risk, retrying a POST request has the risk, that might result in creating resources or do actions twice.

If the POST action is idempotent by nature, then the application can just retry it (like getting a new token by POST /oauth/token).

If not, you many need an advanced implementation which takes into account whether any data has already been sent, or which considers the concrete error message.

Timeout (http code 504) while accessing the server¶

Problem¶

The action times out typically due to long processing, slow connection or server under high load.

Recommendation¶

Retry with an appropriate strategy. Here, retrying after some time can give the server the chance to get up to speed again.

Non-idempotent actions will be a challenge.

Bad gateway (http code 502) while accessing the server¶

Problem¶

This may have many reasons, because the communication between the gateway in the backend and the upstream server failed.

Recommendation¶

Retry with an appropriate strategy. Here, retrying after some time can give the server the chance to get up to a healthy state again.

Non-idempotent actions will be a challenge.

Internal server error (http code 500) while accessing the server¶

Problem¶

This may have many reasons, like out of memory on the gateway or the upstream server.

Recommendation¶

Retry with an appropriate strategy. Here, retrying after some time can give the server the chance to get up to a healthy state again.

Non-idempotent actions will be a challenge.

However, multiple retries might cause a 400 error.

Service unavailable (http code 503) while accessing the server¶

Problem¶

This occurs if the server is temporarily overloaded or not available (just restarting).

Recommendation¶

Retry with an appropriate strategy. Here, retrying after some time can give the server the chance to get up to a healthy state again.

Non-idempotent actions are no challenge here.

All 4XX errors while accessing the server¶

Problem¶

These are error specific and they are typically be caused due to missing or expired token, no access rights to that resource or bad content/format of the request.

Recommendation¶

Here, instead of retrying, take actions to retrieve a new token.

General recommendations¶

Retry immediately (for the cases above, where applicable), or retry more than once.
Then, use an exponential backoff retry strategy, for example, https://docs.aws.amazon.com/general/latest/gr/api-retries.html.

The time frame¶

Typical server errors may last for 2 - 5 minutes. The system restarting scaling out automatically up to 30 min, if manual action is required, or 12-24 hours, if for example internet connection of your agent is down.

The following sections explains concrete suggestions.

Retries in synchronous calls like API implementations or UIs¶

As the API calls have have a timeout (typically 60 s) and the user cannot wait for a long time, stop retrying after a reasonable time 20-30 seconds and provide a reasonable error code or message to the user.

Retries in asynchronous calls like listening to a queue¶

As long as you do not want to lose data, retry wherever applicable until the queue gets filled up.
If only some messages cannot be delivered, putting the messages to a dead letter queue after some time (10s, 2min) and retry independently from there. This will assure that bad messages will not block your queue processing.

Retries in agent implementations¶

Getting an token:
Keep retrying wherever applicable, but use the exponential backoff retry strategy.
Sending data:
Depending on the use case, buffer your data and try to upload them again. Here, the buffer size may limit the time, i.e. at some point, the data will be dropped. You may also have different classes of data.

Last update: July 6, 2023

Except where otherwise noted, content on this site is licensed under the Development License Agreement.