Replicating Datomic/Datalog queries with core.logic

2012-07-16 clojure core.logic datomic

I’ve been toying with Datomic recently, and I particularly like the power of its query language (~Datalog). Mr. Halloway showed a couple of months ago how the query engine is generic enough to be run on standard Clojure collections, gist here. Here is an example from that page of a simple join;

(q '[:find ?first ?height
     :in [[?last ?first ?email]] [[?email ?height]]]
   [["Doe" "John" "jdoe@example.com"]
    ["Doe" "Jane" "jane@example.com"]]
   [["jane@example.com" 73]
    ["jdoe@example.com" 71]])
;; #<HashSet [["Jane" 73], ["John" 71]]>

A question you might ask yourself is how can you use core.logic to do the same kind of queries? It turns out that it’s pretty straightforward, and also very fast. Core.logic provides some convenient helper functions for unification that we are going to use. Here’s an example of how to get a binding map for some logical variables over a collection;

(binding-map '(?first ?last) ["John" "Doe"])
;; {?last "Doe", ?first "John"}

Let’s write a little helper function that maps binding-map over all elements of a seq (of tuples) (I’m using binding-map* so I only need to prep the rule once)

(defn query [rule xs]
  (let [prule (prep rule)]
    (map #(binding-map* prule (prep %)) xs)))

(query '(?answer) (repeatedly 5 #(vector 42)))
;; ({?answer 42} {?answer 42} {?answer 42} {?answer 42} {?answer 42})

Note that I also pick out just the first/height lvars here, which corresponds to the :find clause in the Datomic query. That’s it really - not as generic (and data-driven) as the Datomic query, but working;

(defn join-test [xs ys]
  (let [rx (query '(?last ?first ?email) xs)
        ry (query '(?email ?height) ys)
        r (clojure.set/join rx ry)]
    (map (juxt '?first '?height) r)))

Note that I also pick out just the first/height lvars here, this corresponds to the :find clause in the datomic query. That’s it really, not as generic (and data driven) as the datomic query, but working;

(join-test
 [["Doe" "John" "jdoe@example.com"]
  ["Doe" "Jane" "jane@example.com"]]
 [["jane@example.com" 73]
  ["jdoe@example.com" 71]])
;; (["John" 71] ["Jane" 73])

Here’s the kicker: for this join query, the datomic.api/q’s time complexity estimates to O(n²) (actually 22n²-50n), whereas the time complexity for core.logic/unify + clojure.set/join solution is O(n) (10n). That means that for a run over a modest dataset of size 5000, the difference is 50x!

Edit: The Datomic query used in the benchmark is not optimal, as it turns out. In the example below, you’ll see a more optimal version that’s in fact ~18x faster than the core.logic + clojure.set/join solution.

(defn bench [n f]
  (let [rand-str #(str (java.util.UUID/randomUUID))
        emails (repeatedly n rand-str)
        name-email (reduce (fn [res em]
                             (conj res (vector (rand-str) (rand-str) em)))
                           [] emails)
        email-height (reduce (fn [res em]
                               (conj res (vector em (rand-int 100))))
                             [] emails)]
    (time (count (f name-email email-height)))))


(bench 5000 (partial q '[:find ?first ?height
                         :in [[?last ?first ?email]] [[?email ?height]]]))
;; "Elapsed time: 14757.248824 msecs"
;; 5000

(bench 5000 join-test)
;; "Elapsed time: 185.604 msecs"
;; 5000

;; Optimised Datomic query
(bench 5000 (partial q '[:find ?first ?height
                         :in $a $b
                         :where [$a ?last ?first ?email] [$b ?email ?height]]))
;; "Elapsed time: 10.869 msecs"
;; 5000

Obviously this little example doesn’t convey the true power of either datomic/datalog or core.logic/unifier. Be careful writing your Datomic queries, the running time can be vastly different!

Here some more of the datomic queries converted in a similar fashion.