Skip to main content

Table 2 Unbiased estimators for the Jaccard similarity, the containment index, the second Kulczynski index, and the Whittaker distance, when using FracMinHash sketches instead of the original sets

From: Estimating similarity and distance using FracMinHash

Metric name

Expression

Unbiased estimator

Jaccard similarity

\(J(A,B) = \frac{|A \cap B|}{|A \cup B|}\)

\(\hat{J}(A,B) = J\Big ( \textbf{FRAC}_{s}(A), \textbf{FRAC}_{s}(B) \Big ) \times \frac{1}{ 1 - (1-s)^{|A \cup B|} }\)

Containment index

\(C(A,B) = \frac{|A \cap B|}{|A|}\)

\(\hat{C}(A,B) = C\Big ( \textbf{FRAC}_{s}(A), \textbf{FRAC}_{s}(B) \Big ) \times \frac{1}{ 1 - (1-s)^{|A|} }\)

Kulczynski 2

\(K_2(A,B) = \frac{1}{2} \Big ( \frac{|A \cap B|}{|A|} + \frac{|A \cap B|}{|B|} \Big )\)

\(\hat{K_2}(A,B) = \frac{1}{2}\Big (\hat{C}(A,B) + \hat{C}(B,A)\Big )\)

Whittaker distance

\(W(A,B) = 1 - \frac{1}{2} \Big ( \frac{|A \cap B|}{|A|} + \frac{|A \cap B|}{|B|} \Big )\)

\(\hat{W}(A,B) = 1 - \hat{K_2}(A,B)\)