Rathee's trilogy — CrypTFlow2, SIRNN, and SecFloat

Background

Notation: Let $λ$ be the security parameter, and the length of the secret-shared integer be $l$ -bits.

Assume having the following basic knowledge in Multi-Party Computation:

1-out-of- $k$ ${OT}_{l}$ can be implemented using public-key cryptography or similar methods, with a communication cost of $2 λ + k l$ per instance; using IKNP, this can be reduced to $λ + k l$ .
Addition and scalar multiplication based on 2-party additive secret sharing are almost free (no communication overhead).
Secure multiplication operations for 2-party computation can be implemented using Beaver triples. One secure multiplication operation requires a total communication cost of $2 l$ .

For floating-point storage, we use IEEE 754 Float32 (single-precision standard):

$(- 1)^{S} \times (1. M) \times 2^{E - b i a s}$
Sign bit $S$ 1 bit, exponent $E$ 7 bits, mantissa $M$ 23 bits, $b i a s = 127$
Round-to-nearest-ties-to-even

IEEE 754 Addition

Without loss of generality, assume $x \geq y$ . To compute $z = x + y$ , where:

x = (- 1)^{S_{x}} \cdot (1. M_{x}) \cdot 2^{E_{x}}, y = (- 1)^{S_{y}} \cdot (1. M_{y}) \cdot 2^{E_{y}}

The steps are as follows:

Exponent alignment. If $E_{x} > E_{y}$ , right-shift the mantissa of $y$ : $M_{y} \leftarrow \frac{1. M_{y}}{2^{E_{x} - E_{y}}}, E \leftarrow E_{x}, M_{x} \leftarrow (1. M_{x})$ .
Mantissa addition. Consider the signs of $S_{x}$ and $S_{y}$ . If they have the same sign, calculate $M = M_{x} + M_{y}$ . If they have different signs, subtract the smaller from the larger to get the absolute value of the difference $M = | M_{x} - M_{y} |$ , and attach the correct sign $S$ .
Normalization. Clearly, $max {M_{x} + M_{y}, | M_{x} - M_{y} |} < 4$ , so $M$ only needs to be right-shifted at most once (corresponding to adding 1 to the exponent $E$ ). If $M = 0$ , skip directly to step 4. Otherwise, repeatedly left-shift and decrement $E$ until $M \in [1, 2)$ .
Write the result. Finally, remove the leading 1 of $M$ , and use the final $S, E, M$ as the new 32-bit floating-point result.
Rounding issues may arise in the following four steps: $M_{y} \leftarrow \frac{1. M_{y}}{2^{E_{x} - E_{y}}}$ , $M = M_{x} + M_{y}$ , $M = | M_{x} - M_{y} |$ , and right-shifting $M$ by one bit.

IEEE 754 Multiplication

Multiplication is relatively simple because it doesn't involve comparing the magnitudes of numbers.

XOR the sign bits. $S = S_{x} \oplus S_{y}$ .
Add the exponents. $E = E_{x} + E_{y} - b i a s$ .
Multiply the mantissas. $M = (1. M_{x}) \cdot (1. M_{y})$ , the result is in the range $[1, 4)$ .
Normalize. If the result $\geq 2$ , increment $E$ and right-shift $M$ by one bit, then remove the most significant bit of $M$ .
Rounding operations are involved in the steps of $M = (1. M_{x}) \cdot (1. M_{y})$ and right-shifting $M$ by one bit.

Rounding Method

Let $d, g, f$ represent the least significant bit (decision bit), the most significant discarded bit (guard bit), and the remaining discarded bits (sticky bit) of $M$ , respectively. The "round half to even" rule can be formally expressed as the following formula:

c = g \land (d \lor f)

Simply put, a carry ( $c = 1$ ) is only possible when $g = 1$ . In this case, if $f = 1$ (definitely greater than 0.5) or $d = 1$ (exactly 0.5, but the result must be even), then a carry occurs ( $c = 1$ ).

Note that normalization may be required after rounding.

Building Blocks

Below, we will explain how the entire protocol is constructed using a bottom-up approach.

$F_{M U X}$

The ideal function we need to implement is MUX(b,x)=(b==1?x:0). That is, for a Boolean secret share $[c] = (c_{0}, c_{1}), c_{i} \in {0, 1}$ and an arithmetic secret share $[a] = (a_{0}, a_{1}), a \in Z_{n}$ , the output should be $[a \cdot c]$ . The steps are as follows:

SETUP: $P_{0}$ holds $a_{0}, c_{0}$ , and $P_{1}$ holds $a_{1}, c_{1}$ .

$P_{0}$ and $P_{1}$ each generate random numbers $r_{0}, r_{1} \in Z_{n}$ .
$P_{0}$ sets $(s_{0}, s_{1})$ based on the value of $c_{0}$ . If $c_{0} = 0$ , let $(s_{0}, s_{1}) = (- r_{0}, - r_{0} + a_{0})$ , otherwise it is $(- r_{0} + a_{0}, - r_{0})$ .
$P_{0}$ acts as the sender and performs one round of 1-out-of-2 OT with $P_{1}$ . $P_{0}$ provides two messages $(s_{0}, s_{1})$ , and $P_{1}$ provides $c_{1}$ and ultimately obtains $x_{1} = s_{c_{1}}$ .
$P_{1}$ sets $(t_{0}, t_{1})$ based on the value of $c_{1}$ . If $c_{1} = 0$ , let $(t_{0}, t_{1}) = (- r_{1}, - r_{1} + a_{1})$ , otherwise it is $(- r_{1} + a_{1}, - r_{1})$ .
Another round of OT is performed, and $P_{0}$ obtains $x_{0} = t_{c_{0}}$ .
$P_{0}$ contributes $r_{0} + x_{0}$ , and $P_{1}$ contributes $r_{1} + x_{1}$ . The sum is $(r_{0} + r_{1}) + (x_{0} + x_{1})$ .

We list four possible cases for $(c_{0}, c_{1})$ :

$(s_{0}, s_{1}) = (- r_{0} + c_{0} a_{0}, - r_{0} + a_{0} (1 - c_{0}))$ ， $(t_{0}, t_{1}) = (- r_{1} + c_{1} a_{1}, - r_{1} + a_{1} (1 - c_{1}))$

$x_{0} = t_{c_{0}}, x_{1} = s_{c_{1}}$

$c_{0}$	$c_{1}$	$x_{0}$	$x_{1}$	$(r_{0} + r_{1}) + (x_{0} + x_{1})$
0	0	$- r_{1} + c_{1} a_{1}$	$- r_{0} + c_{0} a_{0}$	0
0	1	$- r_{1} + c_{1} a_{1}$	$- r_{0} + a_{0} (1 - c_{0})$	$a$
1	0	$- r_{1} + a_{1} (1 - c_{1})$	$- r_{0} + c_{0} a_{0}$	$a$
1	1	$- r_{1} + a_{1} (1 - c_{1})$	$- r_{0} + a_{0} (1 - c_{0})$	0

The result is exactly equal to $[a \cdot c]$ , therefore the protocol is correct.

The communication cost is the overhead of two rounds of IKNP-OT, which is $2 (λ + 2 l) = 2 λ + 4 l$ . However, Section 3.1.1 of CryptoFlow2 can optimize the communication overhead to $2 λ + 2 l$ .

$F_{A N D}$

This is a direct implementation with Beaver triples, the communication overhead is $(λ + 16) + 4$ . (See Appendix A1 of CryptoFlow2)

$F_{O R}$

We have $[x \lor y] = [x \oplus y] \oplus [x \land y]$ . Therefore, let $[x \land y] = (z_{0}, z_{1})$ , so each party can directly compute $x_{i} \oplus y_{i} \oplus z_{i}$ . The communication cost is the same as for $F_{A N D}$ , which is $λ + 20$ .

$F_{E Q}$

We divide $x$ and $y$ into $m$ -bit blocks. Consider comparing a pair of blocks $x_{j}, y_{j}$ . We use 1-out-of- $2^{m}$ Oblivious Transfer (OT):

$P_{0}$ randomly selects $(e q_{0, j})_{0}$ and prepares messages $t_{j, k} = (e q_{0, j})_{0} \oplus (x_{j} == k)$ for $k \in [0, 2^{m} - 1]$ .
$P_{0}$ uses these $2^{m}$ messages as input to the OT, $P_{1}$ inputs $y_{j}$ , and obtains $(e q_{0, j})_{1}$ .
If and only if $x_{j} = y_{j}$ , we have $(e q_{0, j})_{0} \oplus (e q_{0, j})_{1} = 1$ , which completes the secure equality comparison of $x_{j}$ and $y_{j}$ .

We can extend the $m$ -bit comparison to $2 m$ -bit using a tree structure and $F_{A N D}$ gates, calculating $(e q_{1, j})_{i} = (e q_{0, j})_{i} \land (e q_{0, j + m})_{i}$ , finally completing the comparison for the entire length. The communication complexity is $⌈ \frac{l}{m} ⌉ (2 λ + 2^{m}) + ⌈ \frac{l}{m} ⌉ (λ + 20)$ , and the number of rounds is $\log l$ .

$F_{G T / L T}$

The approach still involves block processing. First, we compute the result $1 {x_{j} < y_{j}}$ for blocks of length $m$ , which can be done using 1-out-of- $2^{m}$ OT. $P_{0}$ only needs to randomly generate $(l t_{0, j})_{0}$ and prepare $2^{m}$ messages $t_{j, k} = (l t_{0, j})_{0} \oplus 1 {x_{j} < k}$ .

Then, during the merging process, we prioritize the higher-order bits: $1 {x < y} = 1 {x_{H} < y_{H}} \oplus (1 {x_{H} = y_{H}} \land 1 {x_{L} < y_{L}})$ .

The communication cost is less than $λ (4 q) + 2^{m} (2 q) + 22 q$ . With $m = 4$ and $q = l / 4$ , this becomes $λ l + 13.5 l$ , and the number of rounds is $\log l$ .

$F_{L U T}$

Assume the LUT has $2^{m}$ entries, each with $n$ bits.

SETUP: $P_{0}$ randomly selects an index $r \in {0, 1}^{m}$ , and a share $T^{0} [i] \in Z_{2^{n}}$ for each entry $i \in {0, 1}^{m}$ , which is a masked version of the LUT $L$ .

For each $s \in {0, 1}^{m}$ , $P_{0}$ constructs $M_{s} [i] = L [i \oplus r \oplus s] \oplus T^{0} [i], \forall i \in {0, 1}^{m}$ .
$P_{0}$ performs a 1-out-of- $2^{m}$ ${OT}_{n}$ with $P_{1}$ (who selects $s$ ) using these $2^{m}$ messages of length $n$ -bits each. The cost is $2 λ + 2^{m} n$ .
$P_{1}$ sets $T^{1} \leftarrow M_{s}$ . Now $P_{0}$ holds $(T^{0}, r)$ , and $P_{1}$ holds $(T^{1}, s)$ .
In the online phase, $P_{0}$ sends $u = x_{0} \oplus r$ , and $P_{1}$ sends $v = x_{1} \oplus s$ . Both parties simultaneously compute $i^{*} = u \oplus v = x \oplus r \oplus s$ . Finally, $P_{0}$ and $P_{1}$ store $T^{0} [i^{*}]$ and $T^{1} [i^{*}]$ respectively, which together reconstruct $L [x]$ . The communication cost of $2 m$ is negligible.

$F_{W r a p}$

If we want to compute $1 {a + b > 2^{n} - 1}$ , this is equivalent to computing $1 {2^{n} - 1 - a < b}$ . Therefore, we can directly use $F_{G T / L T}$ .

$F_{B 2 A}$

$P_{0}$ and $P_{1}$ each hold a Boolean share of a single bit $c = c_{0} \oplus c_{1}$ , where $c \in {0, 1}$ , and ultimately obtain $d = d_{0} + d_{1} (\mod 2^{n})$ such that $d = c$ .

$P_{0}$ randomly selects $x \in Z_{2^{n}}$ , generates the pair $(x, c_{0} + x)$ , and performs a 1-out-of-2 ${COT}_{n}$ with $P_{1}$ (who holds input $c_{1}$ ). $P_{1}$ receives the result $y_{1}$ . $P_{0}$ sets $y_{0} = 2^{n} - x$ .
Both parties perform local linear correction: $P_{0}$ computes $d_{0} = c_{0} - 2 y_{0}$ , and $P_{1}$ computes $d_{1} = c_{1} - 2 y_{1}$ .

Let's verify the result: $d_{0} + d_{1} = c_{0} + c_{1} - 2 (y_{0} + y_{1})$ .

When $c_{1} = 0$ , $y_{0} + y_{1} = (2^{n} - x) + x = 2^{n}$ , so $d_{0} + d_{1} = c_{0} + c_{1} (\mod 2^{n})$ .
When $c_{1} = 1$ , $y_{0} + y_{1} = (2^{n} - x) + (c_{0} + x) = 2^{n} + c_{0}$ , so $d_{0} + d_{1} = c_{0} + c_{1} - 2 c_{0} = 1 - c_{0}$ . But in this case, $c = c_{0} \oplus 1 = 1 - c_{0}$ , so $d = c$ still holds.

The communication cost is one 1-out-of-2 ${COT}_{n}$ , with a cost of $λ + n$ .

$F_{Z E x t}$

Having already defined $F_{W r a p}$ and $F_{B 2 A}$ , constructing $F_{Z E x t}$ is a natural step. For $m$ -bit additive sharing, we attempt to zero-extend it to $n$ -bit $(n > m)$ . First, we check if there is a carry between the two shares (using $F_{W r a p}$ ), and then convert the resulting Boolean carry $w$ into an arithmetic value in $Z_{2^{n - m}}$ using $F_{B 2 A}$ .

However, since we are only performing a zero-extension operation, the sum of the two shares cannot produce a carry at the $m$ -th bit. Therefore, both parties must subtract a share of $2^{m}$ in $Z_{2^{n}}$ . Since $m$ is public, this can be done locally.

The cost is $Comm (F_{W r a p} + F_{B 2 A}) = λ m + 14 m + λ + (n - m) = λ (m + 1) + 13 m + n$ .

$F_{T R}$

Since we have $F_{Z E x t}$ for extending from smaller to larger bit lengths, we naturally also have the reverse operation, $F_{T R}$ . We assume that we truncate the lower $s$ bits from an $l$ -bit number and output the remaining higher $l - s$ bits (this is essentially equivalent to x >> s):

SETUP: $P_{b}$ splits the original share $x_{b}$ into $u_{b} | | v_{b}$ , where the former is $l - s$ bits and the latter is $s$ bits. It can be proven that:

T R (x, s) = u_{0} + u_{1} + W r a p (v_{0}, v_{1}, s)

The cost is $Comm (F_{W r a p} + F_{B 2 A}) = (λ s + 14 s) + (λ + (l - s)) = λ (s + 1) + 13 s + l$ .

$F_{C r o s s T e r m}$

$F_{C r o s s T e r m}$ is similar to secure multiplication using Beaver triples, but the difference is that in the latter, both $P_{0}$ and $P_{1}$ know a part of the shares of $x$ and $y$ . The applicable condition for $F_{C r o s s T e r m}$ is that $P_{0}$ exclusively holds $x$ , and $P_{1}$ exclusively holds $y$ , and finally each obtains a share of $x * y$ of length $l = m + n$ .

$P_{0}$ writes its $x$ in binary as $x = \sum_{i = 0}^{m - 1} x_{i} 2^{i}, x_{i} \in {0, 1}$ .
For $i \in [0, m - 1]$ , a 1-out-of-2 ${COT}_{l - i}$ is called: $P_{0}$ holds $x_{i}$ , $P_{1}$ holds $y$ , generating $⟨ t_{i} ⟩^{l - i}$ such that $t_{i} = x_{i} \cdot y$ .
Both parties locally compute $⟨ z ⟩^{l} = \sum_{i = 0}^{m - 1} ⟨ t_{i} ⟩^{l - i}$ . Clearly, the sum of the two shares is the product $x \cdot y$ .

The total communication cost is $\sum_{i = 0}^{m - 1} (λ + (l - i)) = m λ + m l - \frac{m (m - 1)}{2} = O (m λ + m n)$ .

$F_{U M u l t}$

The semantics are that both parties hold $x = x_{0} + x_{1} (\mod 2^{m})$ and $y = y_{0} + y_{1} (\mod 2^{n})$ , and the goal is to compute $z = x \cdot y \in Z_{2^{m + n}}$ .

Here, we cannot directly use Beaver triples because they are ring-agnostic, which might lead to unknown wrap-around issues. One approach is to extend both $x$ and $y$ to a sufficiently large ring, perform secure multiplication, and then truncate the result. However, this approach has two problems: first, $F_{T R}$ performs high-order truncation rather than low-order truncation, requiring modifications to the protocol; second, this method is computationally expensive, making it less efficient than the dedicated $F_{U M u l t}$ protocol described below.

We now need to compute $(x_{0} + x_{1}) (y_{0} + y_{1}) = x_{0} y_{0} + x_{1} y_{1} + x_{0} y_{1} + x_{1} y_{0}$ . The first two terms can be computed offline by $P_{0}$ and $P_{1}$ respectively. The latter two terms involve the $F_{C r o s s T e r m}$ mentioned earlier. In short, we have the complete $x_{0} y_{0}$ stored at $P_{0}$ , $x_{1} y_{1}$ stored at $P_{1}$ , and a portion of $⟨ x_{0} y_{1} ⟩$ and $⟨ x_{1} y_{0} ⟩$ held by both parties. The function of $F_{U M u l t}$ is to securely combine these four terms and handle the wrap-around issue.

First, we need to consider whether $x_{0} + x_{1}$ and $y_{0} + y_{1}$ overflow, so we need $F_{W r a p}$ to detect and obtain two overflow bits $w_{x}, w_{y}$ .
Then we can use $F_{M U X}$ to compute the shares $⟨ g ⟩, ⟨ h ⟩$ of g=w_y?x:0 and h=w_x:y:0, handling the possible overflow difference.
Finally, $P_{b}$ outputs $x_{b} y_{b} + ⟨ x_{0} y_{1} ⟩_{b} + ⟨ x_{1} y_{0} ⟩_{b} - 2^{n} ⟨ g ⟩_{b} - 2^{m} ⟨ h ⟩_{b}$ .

Let's try adding the two shares together: $\sum_{b} (x_{b} y_{b} + ⟨ x_{0} y_{1} ⟩_{b} + ⟨ x_{1} y_{0} ⟩) = (x_{0} + x_{1}) (y_{0} + y_{1})$

And $\sum_{b} (- 2^{n} ⟨ g ⟩_{b} - 2^{m} ⟨ h ⟩_{b}) = - 2^{n} g - 2^{m} h = - 2^{n} (w_{y} \cdot x) - 2^{m} (w_{x} \cdot y)$

Combining them, $z = (x_{0} + x_{1}) (y_{0} + y_{1}) - w_{y} \cdot (x \cdot 2^{n}) - w_{x} \cdot (y \cdot 2^{m})$

$= (x + w_{x} 2^{m}) (y + w_{y} 2^{n}) - w_{y} \cdot (x \cdot 2^{n}) - w_{x} \cdot (y \cdot 2^{m}) = x y$ , which cancels out perfectly.

Let $ν = max (m, n), l = m + n$ . The paper provides the specific cost, which is of the order $O (λ ν + ν^{2})$ . Compared to the Beaver triple method (including the generation of multiplication triples), which has a complexity of $O (λ l + l^{2})$ , the order is the same, but the paper claims that $F_{U M u l t}$ requires 1.5 times less communication.

The principle of $F_{S M u l t}$ is similar to $F_{U M u l t}$ , and the communication volume is exactly the same, so we omit the details here.

$F_{D i g D e c}$

The function is to split an $l$ -bit share $⟨ x ⟩$ into $c = ⌈ \frac{l}{d} ⌉$ blocks of $d$ bits each, resulting in ${⟨ z_{i} ⟩}_{i = 0}^{c - 1}$ , where $x = z_{c - 1} | | z_{c - 2} | | \dots | | z_{0}$ . Clearly, performing this operation locally would lead to a wrap-around problem.

Of course, the solution to this problem is also straightforward. First, use $F_{w r a p}$ to determine if the two shares of $⟨ x_{[0, d - 1]} ⟩$ overflow, and obtain the share of the overflow result $⟨ c_{0} ⟩$ . Let $⟨ z_{0} ⟩ = ⟨ x_{[0, d - 1]} ⟩ - 2^{d} ⟨ c_{0} ⟩$ , and add $⟨ c_{0} ⟩$ to the higher-order bits $⟨ x_{[d, l - 1]} ⟩$ of both parties. Finally, recursively perform this process.

The complexity is equivalent to the cost of $(c - 1)$ calls to $F_{w r a p}$ , which is $(c - 1) (λ d + 14 d)$ .

$F_{M S N Z B - P}$

The detection of the global index corresponding to the highest non-zero bit in the $i$ -th $d$ -bit block can also be understood as $⌊ \log_{2} (z_{i}) ⌋ + i \cdot d$ . The specific implementation method here is to solve it directly using $F_{L U T}$ . However, there is a problem: when $z_{i} = 0$ , the expression on the right does not have the same semantics as the implementation on the left. The authors choose to leave the result undefined when $z_{i} = 0$ , leaving it to the higher-level protocol to handle.

I don't quite understand this point. The case where $z_{i} = 0$ can also be solved directly using $F_{L U T}$ without increasing communication costs.

In short, the communication volume is equivalent to a 1-out-of- $2^{d}$ ${OT}_{⌈ \log_{2} l ⌉}$ implementation, so the communication volume is $2 λ + 2^{d} ⌈ \log_{2} l ⌉$ .

$F_{Z e r o s}, F_{O n e H o t}$

The former means determining whether a vector of length $d$ (or a $d$ -bit number) is zero. The latter transforms a share $⟨ k ⟩ \in [0, l - 1]$ into a vector share of length $l$ , such that the sum is $(0, 0, \dots, 1, \dots, 0)$ , where only the $k$ -th element is 1.

These two can be implemented using 1-out-of- $2^{d}$ OT and 1-out-of- $l$ OT $_{l}$ respectively, with costs of $2 λ + 2^{d}$ and $2 λ + l^{2}$ . Although the latter has a higher cost, it is only called once at the end of the parent protocol $F_{M S N Z B}$ , so the total cost is still acceptable.

$F_{M S N Z B}$

The meaning is that given an input $⟨ x ⟩^{l}$ , we compute the value $k$ of the highest zero bit of $l$ , and output a vector share such that the sum is $(0, 0, \dots, 1, \dots, 0)$ , where only the $k$ -th vector is 1. For simplicity, here I assume that $F_{M S N Z B - P}$ is well-defined (i.e., it handles the case where $z_{i} = 0$ ). Let $ι = ⌈ \log_{2} l ⌉$ :

First, compute the input $⟨ x ⟩^{l}$ using $F_{D i g D e c}$ to obtain $⟨ y_{i} ⟩^{d}$ , then for each $i$ , call $F_{M S N Z B - P}$ to obtain $⟨ u_{i} ⟩^{ι}$ .
Call $F_{Z e r o s}$ to obtain Boolean shares $⟨ v_{i} ⟩$ , and use $F_{A N D}$ (i.e., Beaver triple) for any $i = c - 2, \dots, 0$ to compute $w_{i} = w_{i + 1} \land v_{i + 1}$ . Thus, $w_{i} = \prod_{j > i} v_{j}$ . The value of the most significant non-zero bit is the last one from largest to smallest that satisfies $w_{i} = 1$ .
For the most significant digit, let $⟨ z_{c - 1}^{'} ⟩^{ι} = ⟨ u_{c - 1} ⟩^{ι}$ , and for the remaining $i = c - 2, \dots, 0$ , $z_{i}^{'} = MUX (w_{i}, u_{i})$ .
Finally, locally compute $\tilde{z} = \sum_{i = 0}^{c - 1} z_{i}^{'}$ , and call $F_{O n e H o t}$ to obtain the final Boolean vector ${⟨ z_{k} ⟩}_{k \in [0, l - 1]}$ .

Summary

Primitive	Dependent Primitives	Function	Communication Overhead
$F_{M U X}$	$(\binom{2}{1}) - {OT}_{l}$	Ternary operator of length $l$ bits	$2 λ + 2 l$
$F_{O R}$	$F_{A N D}$ , i.e., Beaver triple	Logical OR	$λ + 20$
$F_{E Q}$	$(\binom{2^{m}}{1}) - OT, F_{A N D}$	Arithmetic equality of length $l$ bits	$< \frac{3}{4} λ l + 9 l$
$F_{L T / G T}$	$(\binom{2^{m}}{1}) - OT, F_{A N D}$	Arithmetic comparison of length $l$ bits	$< λ l + 14 l$
$F_{L U T}$	$(\binom{2^{m}}{1}) - {OT}_{n}$	Lookup table with $m$ -bit input and $n$ -bit output	$2 λ + 2^{m} n$
$F_{Z E x t}$	$F_{W r a p}, F_{B 2 A}$	Zero extension from $m$ bits to $n$ bits	$λ (m + 1) + 13 m + n$
$F_{T R}$	$F_{W r a p}, F_{B 2 A}$	Truncation of $l$ bits to the high $l - s$ bits	$λ (s + 1) + l + 13 s$
$F_{U M u l t}, F_{S M u l t}$	$F_{C r o s s T e r m}, F_{W r a p}, F_{M U X}$	Unsigned/signed multiplication of length $m, n$ bits	$O (λ l + l^{2})$
$F_{M S N Z B}$	$F_{D i g D e c}, F_{M S N Z B - P}, F_{O n e H o t}$ , etc.	Index of the highest non-zero bit in a number of length $l$ bits	$\leq λ (5 l - 4) + l^{2}$

Primitives

We have finally built all the basic protocols corresponding to the secfloat paper, and now we move on to the important primitives constructed in this paper.

$F_{F P c h e c k}^{p, q}$

This is used to check whether floating-point numbers overflow or underflow given floating-point parameters $p$ and $q$ (the IEEE 754 standard uses $p = 8, q = 23$ ). This is an innovative aspect of this paper: the ability to choose the floating-point parameters independently, without necessarily adhering to the IEEE 754 format.

α = (z, s, e, m)
if 1{e > 2^{p-1} - 1} then
   m = 2^q; e = 2^{p-1}
if 1{z = 1} ∨ 1{e < 2 - 2^{p-1}} then
   m = 0; e = 1 - 2^{p-1}; z = 1
Return (z, s, e, m)

This logic aligns with the paper's approach to handling overflows. Here's a brief description of how the protocol works:

The second line involves a $F_{G T / L T}$ operation because $p$ is public; the fourth line involves an $F_{E Q}$ operation.
The assignment operations in the third and fifth lines do not raise privacy concerns because the assignment constant $c$ is public; therefore, $P_{0} = 0$ and $P_{1} = c$ can be used directly.
The if-branch essentially represents the specific semantics of the $F_{M U X}$ operation.

$F_{T R S}^{l, s}$

$F_{T R S}$ builds upon $F_{T R}$ by adding a carry determination for the floating-point sticky bit $S$ . A carry operation is performed if any of the discarded lower $s$ bits is 1, and $L S B (x) = 0$ . This is equivalent to (x >> s) | (x & (2**s - 1) != 0). The right shift is handled by $F_{T R}$ (or a combination of $F_{W r a p}$ and $F_{B 2 A}$ ), but an additional $F_{Z e r o s}$ operation is required afterward, which is costly.

Since $F_{Z e r o s}$ is also equivalent to checking whether the sum of two shares is $2^{s}$ , or whether both are 0, its essence is also a comparison operation (or an $F_{W r a p}$ operation). Therefore, the authors chose to combine $F_{W r a p}$ and $F_{Z e r o s}$ into $F_{W r a p & A l l 0 s}$ , with a cost equivalent to only one comparison operation and one $F_{A N D}$ operation (there is a typo in the paper; it's not the cost of two $F_{A N D}$ operations).

$F_{R N T E}^{l}$

$F_{R N T E}$ is equivalent to a right shift with rounding, $x ≫_{R} r$ . As discussed previously, the IEEE rounding logic is as follows:

c = g \land (d \lor f)

We first use TRS(x, r-2) to check if the bits after $f$ are 1. If they are, we add them back to ensure the sticky bit logic is correct. At this point, the last three bits of $x$ are $d, g, f$ . Then, we use $F_{L U T}$ to hardcode this 8-bit rounding logic into the result $c$ . Finally, we call TR(x, 2) and add the rounding result $c$ .

$F_{R o u n d^{*}}^{p, q, Q}$

Given the floating-point number $α = (z, s, e, m)$ , round the normalized mantissa $m$ from $Q$ -bit precision to $q$ -bit precision, and handle any carry overflow resulting from the rounding. Note that $m$ is represented in fixed-point form, meaning the mantissa $1. M$ corresponds to the fixed-point number $m = (1. M) \times 2^{Q}$ . Therefore, the range of $m$ is $[2^{Q}, 2^{Q + 1})$ . The protocol logic is as follows:

if 1{m >= 2^{Q+1} − 2^{Q−q−1}} then
   Return (e+1, 2^q)
else
   Return (e, m >>R (Q−q))

The following explains the correctness of the code's logic. The mantissa is considered in two cases:

First, when the first $q$ bits of the mantissa are all 1s, and the $(q + 2)$ -th bit (corresponding to $2^{Q - q - 1}$ ) is also 1. This exactly satisfies the minimum rounding condition of RNTE, so the entire $Q$ bits will be rounded to 2.0. After normalization, $e = e + 1$ , and the mantissa $m$ returns to its original $q$ -bit precision normalized form $2^{q}$ .
Second, in most cases where rounding does not require further normalization, simply perform a right shift of $Q - q$ bits with rounding according to the previous $F_{R N T E}$ protocol.

$F_{F P A d d}^{p, q}$

In the Background Chapter, we summarized the method for ordinary floating-point addition. Below, we will show how this method corresponds to pseudocode:

Exponent alignment. If $E_{x}$ > $E_{y}$ , left-shift the mantissa of $x$ : $M_{x} \leftarrow (1. M_{x}) 2^{E_{x} - E_{y}}, E \leftarrow E_{y}$ .

Add the mantissas. Consider the signs of $S_{x}$ and $S_{y}$ . If they have the same sign, calculate $M = M_{x} + M_{y}$ . If they have different signs, subtract the smaller one from the larger one to get the absolute value of the difference, $M = | M_{x} - M_{y} |$ , and attach the correct sign $S$ .

Here we consider the cases of same and different signs. When the signs are the same, the XOR of the sign bits of $β_{1}$ and $β_{2}$ is 0; otherwise, it is 1. Therefore, when $β_{1} . s \oplus β_{2} . s = 1$ , we change the sign of $m_{2}$ to perform the subtraction operation.

Normalization. Here we find the most significant bit in the sum $m$ . To ensure sufficient space for rounding, the protocol zero-extends $m_{1}$ and $m_{2}$ , which have precision $q$ , to $2 q + 2$ bits in the first step (but note that the precision is $2 q + 1$ ). Assuming the result of MSNZB is $k$ , normalizing the most significant bit (i.e., aligning it with the $2 q + 1$ bits used in the subsequent Round* protocol) requires a left shift of $2 q + 1 - k$ bits, corresponding to multiplication by $K = 2^{2 q + 1 - k}$ .

Regarding the calculation of the exponent, since it was shifted left by $2 q + 1 - k$ bits, and also because the precision change added $q + 1$ , the new exponent becomes $e = e - (2 q + 1 - k) + (q + 1) = e + k - q$ .

Writing the result. Finally, a round of rounding is performed using $F_{R o u n d^{*}}^{p, q, 2 q + 1}$ . Since $| m_{1} | > | m_{2} |$ , the sign bit is determined by $β_{1} . s$ . Finally, check if the normalized result is within the valid range of floating-point numbers.

$F_{F P M u l}^{p, q}$

XOR the sign bits. $S = S_{x} \oplus S_{y}$ . (Line 7)
Add the exponents. $E = E_{x} + E_{y} - b i a s$ . (Line 1, the paper sets $b i a s = 0$ here)
Multiply the mantissas. $M = M_{x} \cdot M_{y}$ , the result is in the range $[2^{2 q}, 2^{2 q + 2})$ . (Line 2)
Normalization. If the result $\geq 2^{2 q + 1}$ , increment $E$ and right-shift $M$ by one bit, finally removing the most significant bit of $M$ .

If the result $\geq 2^{2 q + 1}$ , follow the else branch on line 6, otherwise follow the if branch on line 4.

Rounding operations are involved in the steps of $M = M_{x} \cdot M_{y}$ and right-shifting $M$ by one bit. (i.e., $F_{R N T E}^{l}$ on lines 4 and 6)

Math Functions

In the book これでなっとく！数学入門――無限・関数・微分・積分・行列を理解するコツ written by 瀬山士郎, a following excerpt explained how to calculate trigonometric functions on computers/calculators (I don't have the English version of this book, so I summarized it using ChatGPT):

The pages explain how trigonometric functions, especially the sine function, are calculated. Through a short classroom dialogue, the text raises the question of how a calculator can produce a value such as $\sin (0.7)$ . It then explains that trigonometric functions cannot be represented by finite polynomials, but they can be expressed as infinite power series, known as Taylor (or Maclaurin) expansions. Using $\sin x$ as an example, the book shows its series expansion and explains that by taking only the first few terms, one can obtain an approximate value of $\sin (0.7)$ . The more terms used, the more accurate the result becomes. Although the series is infinite and never truly ends, calculators compute enough terms to achieve practical accuracy, which is sufficient for most real-world applications.

$F_{F P \sin π}^{8, 23}$

Based on this idea, we can actually elaborate on how a computer calculates $\sin x$ .

First, use trigonometric identities to reduce the range of $x$ to $[0, π / 2]$ . (Range reduction)

This is the approach used in the paper, but in practice, rounding to $[- π / 4, π / 4]$ and then using the double-angle formula reduces the error by an order of magnitude.

Use Taylor series expansion for approximation, for example, $\sin x \approx x - \frac{1}{6} x^{3} + \frac{1}{120} x^{5}$ . (Polynomial Evaluation)
Use Horner's method: $x - \frac{1}{6} x^{3} + \frac{1}{120} x^{5} = ((\frac{1}{120} x^{2} - \frac{1}{6}) x^{2} + 1) x$ , to reduce the number of multiplications and lower the error.

These operations only require floating-point addition and multiplication. The coefficients of the polynomial are relatively fixed and can be obtained through a lookup table.

We now correspond this to the specific primitive function $F_{F P \sin π}^{8, 23}$ , which is the calculation of $\sin π x$ . The steps are as follows:

First, handle special cases: When $| x | > 2^{23}$ , since $q = 23$ , it exceeds the precision of the mantissa, so $x$ must be an integer. Therefore, $\sin π x = 0$ . Another special case is when $| x | < 2^{- 14}$ , in which case $\sin π x \approx π x$ , and the error is within the error of the floating-point representation of $x$ itself.

Range Reduction Step: The goal is to calculate the parity bit $a \in {0, 1}$ and the interval $δ$ from the input $α$ . Let $| α | = 2 K + a + n$ . If $n < 0.5$ , then $δ = n$ ; otherwise, $δ = 1 - n$ . Therefore, $\sin π α = (- 1)^{a \oplus 1 {α < 0}} \sin π δ$ .

m = α.m * 2^{α.e + 14}
a = TR(m, q+14); n = m mod 2^{q+14}

Transform $α$ into an integer. Since $| α | = 2^{α . e} \cdot \frac{α . m}{2^{q}}$ , we have $m = α . m \cdot 2^{α . e + 14} \approx | α | \cdot 2^{q + 14}$ . Then $a$ is the integer part of $| α |$ , and $n$ is the fractional part of $| α |$ .

f = (n > 2^{q+13} ? 2^{q+14} - n : n)
k,K = MSNZB(f); f = f * K
z = 1{f=0}; e = (z ? -2^{p-1}+1 : k - q - 14)

The first line ensures that $f = δ \cdot 2^{q + 14}$ . The second line finds the most significant bit of $f$ , thus normalizing $f$ , and the third line sets the correct exponent bits $e$ .

When $f = 0$ , $\sin π α = 0$ . The zero bit $z$ is set to $1$ , and the exponent bits $e$ are set to $- 2^{p - 1} + 1$ according to the convention in the paper.

δ = (z, 0, e, TR(f, q+14−Q))

Finally, the fixed-point number $f$ is stored again with $Q$ bits of precision and combined with other parameters to form the floating-point number $δ$ .

Polynomial Evaluation Step: Note that the paper uses the Remez method and other more accurate approximation methods, rather than Taylor expansion:

if 1{δ.e < −14} then
   µ = Float_{p,Q}(π) ⊗_{p,Q} δ

Previously, in the special case where $| δ | \leq 2^{- 14}$ , we had $\sin π x \approx π x$ .

if 1{δ.e < −5} then
   idx1 = δ.e + 14 mod 2^4
   (θ1, θ3, θ5) = GetC_{4,9,5,p,Q}(idx1, Θ1_sin, K1_sin)

Consider the case where $2^{- 14} < | δ | < 2^{- 5}$ . Here, the GetC function performs a table lookup operation. By selecting the constant class K1_sin, it retrieves the corresponding parameters (θ1, θ3, θ5) from the table Θ1_sin using the index idx1 (which is the lower 4 bits of δ.e+14).

4 is the number of bits in the table, 9 is the number of splines (coefficient pairs), and 5 is the degree of the fitted polynomial.

idx2 = 32 · (δ.e + 5 mod 27)
idx2 = idx2 + ZXt(TR(δ.m, Q−5) mod 32, 7)
idx2 = 1{δ.e = −1} ? 127 : idx2
(θ1, θ3, θ5) = GetC_{7,34,5,p,Q}(idx2, Θ2_sin, K2_sin)

In the interval $2^{- 5} \leq | δ | < 0.5$ , a lookup table is used for differentiation based on the lower 2 bits of δ.e + 5 and the upper 5 bits of δ.m (this is the second innovation of the paper: segmented indexing) to call the corresponding (θ1, θ3, θ5) that minimizes the error. The case of $| δ | = 0.5$ is handled separately.

Use the Horner's method to calculate the corresponding floating-point value:

Δ = δ ⊗_{p,Q} δ
µ = ((θ5 ⊗ Δ) ⊞* θ3) ⊗ Δ
µ = (µ ⊞* θ1) ⊗ δ
Return (µ.z, a ⊕ α.s, Round*(µ.e,µ.m))

$Δ = δ^{2}$

$μ = (θ_{5} Δ + θ_{3}) Δ = θ_{5} δ^{4} + θ_{3} δ^{2}$

$μ = (μ + θ_{1}) δ = θ_{5} δ^{5} + θ_{3} δ^{3} + θ_{1} δ \approx \sin π δ$

As mentioned earlier, $\sin π α = (- 1)^{a \oplus 1 {α < 0}} \sin π δ$ , so a ⊕ α.s determines the sign bit, and the final result is obtained by rounding.

To summarize, the $F_{F P \sin π}^{8, 23}$ function first performs range reduction on the input, reducing the problem to a small interval from 0 to 0.5 and recording the sign compensation; a fast path is used for extremely large or small inputs. Then, the function determines the segment based on the exponent and mantissa bits, selects the corresponding polynomial coefficients using a lookup table, calculates the approximate value using Horner's method, and finally restores the sign and rounds to a single-precision output.

One potential drawback of this method is that the table used by the GetC function is ad-hoc and differs from the protocol used for floating-point arithmetic operations. If one wants to arbitrarily specify $p, q$ , or migrate to SecDouble, this coefficient table needs to be recalculated to ensure the error is less than 1 ULP. The paper sacrifices algorithmic flexibility in pursuit of performance and accuracy.

$F_{F P \log_{2}}^{8, 23}$

Let $α = m \cdot 2^{N}$ , where $m \in [1, 2)$ and $N = α . e$ . Let $a = 1 {N = - 1}$ . Then there are two cases:

If $N \neq - 1$ , let $δ = m - 1$ , then $δ \in [0, 1)$ , and $\log_{2} α = \log_{2} m + N = \log_{2} (1 + δ) + N$ .
If $N = - 1$ , if $m \approx 2$ , $\log_{2} m + N$ will approach zero, and the subtraction of two large numbers will affect the precision. Therefore, the author transforms the expression to $\log_{2} α = \log_{2} (m / 2) = \log_{2} (1 - (1 - m / 2))$ . Let $δ^{'} = 1 - m / 2$ , then $δ^{'} \in (0, 0.5]$ , and then calculate $\log_{2} (1 - δ^{'})$ . This avoids the error problem caused by subtracting two numbers.

\log_{2} (α) = {\begin{cases} N + \log_{2} (1 + δ) & a = 0, \\ \log_{2} (1 - δ^{'}) & a = 1. \end{cases}

Then, two sets of approximate polynomials $(θ_{0}^{a}, θ_{1}^{a}, θ_{2}^{a}, θ_{3}^{a})$ and $(θ_{0}^{b}, θ_{1}^{b}, θ_{2}^{b}, θ_{3}^{b})$ are constructed for $a = 0$ and $a = 1$ respectively. The values of $δ . e$ and $δ . m$ within the specified range are then looked up in a table using the piecewise indexing method.

Next, the value of $\log_{2} (1 + \cdot)$ is calculated using Horner's method. For the final addition of $N$ , $N$ is first converted to a floating-point number using a lookup table, and then the $F_{F P A d d}^{p, q}$ protocol is called.

Range Reduction Step:

a = 1{N = −1}
f = a ? (2^{q+1} − α.m) : (α.m − 2^q)
k,K = MSNZB(f); f = f *_{q+1} K
e = a ? (k − q − 1) : (k − q)

When $N = - 1$ (in which case $a = 1$ ), $f = 1 - m / 2$ ; otherwise (in which case $a = 0$ ), $f = m - 1$ , and then normalize $f$ .

z = 1{f = 0}; e = (z ? −2^{p−1}+1 : e); 
N = α.e; δ = (z,0,e, f *_{Q+1} 2^{Q−q})

A floating-point number $δ$ is constructed using $f$ , and the precision is increased from $q = 23$ bits to $Q = 27$ bits, in preparation for the subsequent evaluation.

Polynomial Evaluation Step:

if 1{δ.z} then
   µ = Float_{p,Q}(0)

When $δ = 0$ , $\log_{2} (1 + \cdot) = \log_{2} 1 = 0$ , so $μ = 0$ is returned.

if 1{δ.e < −5} then
   idx1 = (δ.e + 24) mod 2^5
   (θa0,θa1,θa2,θa3) = GetC_{5,19,4,p,Q}(idx1, Θ1_log, K1_log)
   (θb0,θb1,θb2,θb3) = GetC_{5,18,4,p,Q}(idx1, Θ3_log, K3_log)

When $2^{- 24} \leq δ < 2^{- 5}$ , the lower 5 bits of δ.e + 24 are used as the index to look up the table Θ1_log for the case $a = 1$ , and the table Θ3_log for the case $a = 0$ . Both lookups must be performed to avoid data dependencies.

else
   idx2 = 16 · (δ.e + 5 mod 2^7)
   idx2 = idx2 + ZXt(TR(δ.m, Q−4) mod 16, 7)
   (θa0,θa1,θa2,θa3) = GetC_{7,20,4,p,Q}(idx2, Θ2_log, K2_log)
   (θb0,θb1,θb2,θb3) = GetC_{7,32,4,p,Q}(idx2, Θ4_log, K4_log)

When $2^{- 5} \leq δ < 1$ , use the lower 3 bits of δ.e + 5 and the upper 4 bits of δ.m as the index to look up the table Θ2_log for the case $a = 1$ , and the table Θ4_log for the case $a = 0$ .

(θ0,θ1,θ2,θ3) = a ? (θa0,θa1,θa2,θa3) : (θb0,θb1,θb2,θb3)

Based on the classification of the value of $a$ discussed above, select the final polynomial coefficients.

Horner's method and the final addition of $N$ .

µ = ((θ3 ⊗ δ) ⊞* θ2) ⊗ δ
µ = ((µ ⊞* θ1) ⊗ δ) ⊞* θ0

Calculate the value of the cubic polynomial $μ = θ_{3} δ^{3} + θ_{2} δ^{2} + θ_{1} δ + θ_{0} = \log_{2} (1 + \cdot)$ .

β = LInt2Float(N)
β′ = (β.z,β.s,β.e, β.m *_{Q+1} 2^{Q-6})

Using a small LUT (lookup table), $N$ is converted into a low-precision (6-bit mantissa) floating-point form $β$ , which is then converted into a high-precision form $Q$ , denoted as $β^{'}$ .

γ = a ? µ : (µ ⊞* β′)
Return (γ.z, γ.s, Round*(γ.e, γ.m))

Finally, a floating-point addition is performed:

If $a = 0$ , then $γ = μ + β^{'} = \log_{2} (1 + \cdot) + N$ ,
If $a = 1$ , then $γ = μ = \log_{2} (1 - \cdot)$ .

The final result of the protocol is obtained by rounding $γ$ .

Rathee's trilogy — CrypTFlow2, SIRNN, and SecFloat

Background

IEEE 754 Addition

IEEE 754 Multiplication

Rounding Method

Building Blocks

FMUX

FAND

FOR

FEQ

FGT/LT

FLUT

FWrap

FB2A

FZExt

FTR

FCrossTerm

FUMult

FDigDec

FMSNZB−P

FZeros,FOneHot

FMSNZB