Skip to content

significant degredation of basilisp bootstrap start up time with large syspaths #1135

Closed
@ikappaki

Description

@ikappaki

Hi,

there is a significant degradation in Basilisp bootstrap performance when sys/paths is large.

This is because basilisp.core attempts to load data readers via (load-data-readers) from each directory in sys/path, that leads to a full directory traversal via file-seq for each path, which can be very slow.

(defn- data-readers-files []
  (->> sys/path
       (mapcat file-seq) ;; ====> FULL TRAVERSAL OF EACH path
       (filter (comp #{"data_readers.lpy" "data_readers.cljc"} name))
       (group-by #(.-parent %))
       vals
       ;; Only load one data readers file per directory and prefer
       ;; `data_readers.lpy` to `data_readers.cljc`
       (map #(first (sort-by name > %)))))

(defn- load-data-readers []
  (alter-var-root
   #'*data-readers*
   (fn [mappings additional-mappings]
     (reduce (fn [m [k v]]
               (if (not= (get m k v) v)
                 (throw (ex-info "Conflicting data-reader mapping"
                                 (merge (meta k) {:conflict k, :mappings m})))
                 (assoc m k v)))
             mappings
             additional-mappings))
   ;; Can't use `read` when altering `*data-readers*` so do reads ahead of time
   (->> (concat (data-readers-files)
                (data-readers-entry-points))
        (mapcat #(make-custom-data-readers % nil))
        doall)))

(load-data-readers)

In large python environment with tens of packages in the sys/path, this approach can results in several seconds of startup time, which severely impacts applicability.

The Clojure documentation for *data-readers* appears to suggests that it should only look for data reader files at the root of each classpath directory:

(def ^{:added "1.4" :dynamic true} *data-readers*
  "Map from reader tag symbols to data reader Vars.

  When Clojure starts, it searches for files named 'data_readers.clj'
  and 'data_readers.cljc' at the root of the classpath. Each such file
  must contain a literal map of symbols, like this:

However, the tests/basilisp/test_data_readers.lpy file contains test cases that look into subdirectories of the sys/path to locate the data reader files, which makes my earlier interpretation ambiguous whether it should traverse the entire sys/path or only the root of each directory.

A quick look at the Clojure code for the same suggests that the getResources() method of the class loader is being used to find the files, which is not clear what it does in this respect.

https://github.com/clojure/clojure/blob/ce55092f2b2f5481d25cff6205470c1335760ef6/src/clj/clojure/core.clj#L7908-L7912

(defn- data-reader-urls []
  (let [cl (.. Thread currentThread getContextClassLoader)]
    (concat
      (enumeration-seq (.getResources cl "data_readers.clj"))
      (enumeration-seq (.getResources cl "data_readers.cljc")))))

Nevertheless, traversing every directory and subdirectory is costly, especially with large sys/paths. This result in significant delays during startup, which might not be acceptable for environments where start up speed is critical, not withstanding the user experience.

Possible solution I can think of.

  1. Leave it as is and accept the performance cost. This is not ideal since it doesn't scale well with large sys/path environments.
  2. Limit to the root of each directory. This will reduce the cost significantly and has a good scalability, but will break libs that store their data reader files deeper in the source tree.
  3. Make the data readers loading optional. Provide a configuration to disable data reader loading (similar to the compiler options?). Some libraries might not function properly if they provide data readers and this option is turned off.

Any other suggestions?

Thanks

I've created a patch that limits the search to the root of each directory in the sys/path, only to realize later there were a few test cases explicitly expecting full dir traversals

(defn- data-readers-files []
  (->> sys/path
       (map #(if (= % "") "." %))
       (mapcat (fn [dir] (when (os.path/isdir dir)
                           (->> (os/listdir dir)
                                (map #(pathlib/Path (os.path/join dir %)))
                                (filter #(os.path/isfile %))))))
       (filter (comp #{"data_readers.lpy" "data_readers.cljc"} name))
       (group-by #(.-parent %))
       vals
       ;; Only load one data readers file per directory and prefer
       ;; `data_readers.lpy` to `data_readers.cljc`
       (map #(first (sort-by name > %)))))

cc'ing @mitch-kyle whoI believe is the author of the current functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    component:basilisp.coreIssue pertaining to basilisp.core namespaceissue-type:bugSomething isn't workingperformanceIssue pertaining to performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions