Skip to content

Interface contract for robust error handling? #11

@didibus

Description

@didibus

In order for a user to implement a production robust use of the library APIs, the error scenarios need to be well documented, and ideally they should all follow well defined patterns so the user has a simple way to appropriately handle all error scenarios.

For example, a use case I've tried to implement is to fetch thousands of URLs, attempting for all of them to succeed.

The starting code looks like:

(ns script
  (:require [babashka.http-client :as http]))
(->> ["https://clojure.org"
      "https://google.com"
      "https://amazon.com"
      "https://localhost"
      "garbage"]
     (mapv #(http/get % {:async true}))
     (mapv deref)
     (println))

Where the vector of URL strings would contain thousands more URLs in it.

In a real life scenario, we can't assume we have the cleanest data, first of all that means some URLs might be malformed to begin with, such as "garbage" in my example.

In our case, http/get will not behave in an async way even though we said :async true, because it's URL to URI parser is actually ran synchronously and throws synchronously on error, instead of always returning a CompletableFuture.

This means users must know that some errors are synchronous, while others are async. We might update the code as follows:

(->> ["https://clojure.org"
      "https://google.com"
      "https://amazon.com"
      "https://localhost"
      "garbage"]
     (mapv #(try (http/get % {:async true})
                 (catch Exception e (delay e))))
     (mapv deref)
     (println))

This delays the synchronous errors, so that they show up asynchronously later when we deref the results, allowing us to move all the error handling around the deref.

Next the user needs to know how to handle errors in the deref part. Because we want this to be a robust implementation, the user wants to perform a retry on all errors that indicate a transient issue, but would like to log and continue on all errors that indicate an issue with the input or a bug, for which retrying would be fruitless, and finally after some number of retries if the retryable ones still fail, it wants to similarly log and continue.

The first code refactor might look like:

(->> ["https://clojure.org"
      "https://google.com"
      "https://amazon.com"
      "https://localhost"
      "garbage"]
     (mapv #(try (http/get % {:async true})
                 (catch Exception e (delay e))))
     (mapv #(try (deref %)
                 (catch Exception e e)))
     (println))

Here we're turning the deref's thrown exceptions into return values, so we can process them after to check for which one succeeded or errored.

Now we refactor the code again so we can handle errors:

(defn error-type
  [result]
  (cond (instance? clojure.lang.ExceptionInfo result)
        :retryable-error
        (instance? Exception result)
        :non-retryable-error
        (and (map? result) (not= 200 (:status result)))
        :retryable-error
        :else
        :success))

(defn batch-fetch
  [urls]
  (->> urls
       (mapv #(try (http/get % {:async true})
                   (catch Exception e (delay e))))
       (mapv #(try (deref %)
                   (catch Exception e e)))
       (group-by error-type)
       (doall)))

(batch-fetch ["https://clojure.org"
              "https://google.com"
              "https://amazon.com"
              "https://localhost"
              "garbage"])

This is the crux of the issue, look at the error-type function. Here we want a reliable way to identify everything that's not a success, and for each thing that's not a success, we want to identify if it's retryable or not retryable. But because babashka.http-client doesn't provide well defined errors, it is very difficult to know how to identify the type of errors that could happen exhaustively.

Currently, it appears sometimes the error will be an ex-info but not always. If it is an ex-info, it's not clear how to tell what type of error it is, or if there will always be a :status key on the ex-data for it.

The user would need documented descriptions of where errors are thrown/returned, the set of all possible type of errors at each point, and their structure.

At a minimum, a consistent structure for all errors would help a lot, for example making all errors ex-info with the type of error documented on a :type key on the ex-data, along with a documented list of the set of all possible types of errors, and that all errors will always have an ex-data with the details.

That way inside error-type we could simply inspect the ex-data :type key and if it's one of a retryable type, we can classify it as retryable, otherwise non-retryable.

This would also allow us, for non-retryable ones, to know what to log in the error, since we can always expect an ex-data with some common structure, and a ex-message, we can log something like:

(log (str (:type (ex-data err)) ", " (ex-message err) ", " (ex-data err)))

Finally, it would be nice if all errors contained the URL as well which it failed on, because for the retryable ones for example, the user would want to retry, but if the error no longer contains the information to map back to which of the async get this was for, knowing which one to retry is non-trivial. Thus either having the URL as part of the error so we can retry using the error data itself, or having a feature to provide some ID to the async get call which is included on the error so we can map it back would be ideal.

Regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions