CONNOTECH Experts-conseils Inc.

Software Acceleration for Public Key Cryptography

by Thierry Moreau

May 1997

Public key cryptography implies the use of large integer arithmetic operations, with a high proportion of modulus operations (as a reminder, the modulus operation is the remainder of an integer division). The division is by far the most compute intensive of the four elementary operations to the point where the the performance of a public key cryptosystem is always deceptively low.

To make these computations relatively efficient, the Montgomery modulo reduction algorithm is very useful. The mathematically inclined reader may refer to the original article by Peter L. Montgomery, Modular Multiplication Without Trial Division (in Mathemetics of computations, Vol. 44, no. 170, April 1985, pp 519-522). Two implementations of the Montgomery algorithm are reported, respectively

in the article by Stephen R. Dussé and Burton S. Kaliski Jr., A Cryptographic Library for the Motorola DSP56000 (in Advances in Cryptology, Eurocrypt'90, Lecture Notes In Computer Science no. 473, pp 230-244, Springer-Verlag, 1990), targeted at digital signal processors with peculiar instruction sets, and
in the article by S. E. Eldridge and Colin D. Walter, Hardware Implementation of Montgomery's Modular Multiplication Algorithm (in IEEE Transactions on Computers, Vol. 46, no. 6, June 1993, pp 693-699), targeted at dedicated integrated circuit design.

Simplification and adaptation of these references may be useful to the Montgomery method in the context of a general purpose computer (for software implementation). This is the subject of the present document. The intended readership includes the computer programmers and analysts.

The Montgomery method is useful when repeated modulus operations are needed with a constant divisor. When a public key cryptosystem requires the computation of a "modular inverse", the Montgomery method is less useful because the divisor is not a constant in this computation. Although the Montgomery method is useful for many public key cryptosystems, a particular focus of this document is the PEKE cryptosystem. For one of the two parties taking part of the PEKE cryptosystem protocol, the computation load is relatively small for a public key cryptosystem. The use of a relevant optimization technique may make the difference between unacceptable and bearable performance characteristics for a PEKE cryptosystem implementation.

The computation facilitated with the Montgomery method is repetitive "modular reduction" operations "mod N", where N is odd integer constant. This is because there are precomputations (on N) of which processing cost cannot be neglected. Let b be the natural computer word size of the target computer and n be such that bⁿ>N. The values to be pre-computed are b²ⁿ mod N, and also N₀' to be specified hereafter.

Montgomery multiplication, in the multiple precision case as with public key cryptography, allows fast modular arithmetic for a modulus N relatively prime to bⁿ, where bⁿ>N, and arithmetic modulo b is easy (e.g. 2⁷⁶⁷<N<2⁷⁶⁸, N is odd, b=2⁸, n=96, so bⁿ=2⁷⁶⁸). Let B=bⁿ (not to be confused with B_i, a notation proper to the PEKE specification). Given T satisfying 0<=T<=B×N, the Montgomery reduction algorithm efficiently computes T×(B^-1) mod N. Our focus with PEKE is the modular squaring, but we will show modular multiplication for sake of generality, so let T=x'×y' where x'=x×B mod N and y'=y×B mod N; then the Montgomery reduction algorithm efficiently computes x'×y'×(B^-1) mod N=x×y×B mod N.

Let us use the notation M_N,B(X,Y) for the result of X×Y×(B^-1) mod N, using the Montgomery reduction algorithm. Then x×y×B mod N=M_N,B(x×B mod N,y×B mod N). To convert an integer x, 0<=x<N, into x×B mod N, compute M_N,B(x,b²ⁿ mod N). The value b²ⁿ mod N should be pre-computed once and for all; this must be done with a general purpose division operation. See for instance the article by Per Brinch Hansen, Multiple-length Division Revisited: a Tour of the Minefield (in Software Practice and Experience, vol. 24(6), June 1994, pp 579-601). To recover x from x×B mod N, compute M_N,B(x×B mod N,1). There are two routes to complete a single modular multiplication: M_N,B(M_N,B(x,y),b²ⁿ mod N)=x×y mod N, or by pre-computing x'=M_N,B(x,b²ⁿ mod N) and y'=M_N,B(y,b²ⁿ mod N), and then M_N,B(M_N,B(x',y'),1)=x×y mod N. Whenever a series of multiplications is performed on a same set of inputs or intermediate results, the latter route is more efficient.

A series of multiplications occurs in the PEKE cryptosystem whenever the PEKE parameter t is greater than 1. Moreover, the PEKE cryptosystem may be changed without altering its security to fully exploit the Montgomery method. If this is done, the PEKE equations are restated as
x'=M_N,B(x,b²ⁿ mod N),
x₀'=M_N,B(x',x'), instead of the original x₀=x² mod N,
x_i+1'=M_N,B(x_i',x_i'), instead of the original x_i+1=x_i² mod N, and
B_i=x_i' mod 2^k, instead of the original B_i=x_i mod 2^k.
Formally, this requires an additional public parameter for the PEKE cryptosystem, that is the base bⁿ. The conversion from the value x_t'=x_t×bⁿ mod N to the value x_t is done with the equation
x_t=M_N,B(x_t',1);
it may be assigned to the initiator of the PEKE protocol to further reduce the computation load for the responder.

Moreover, the original PEKE specification requires another integer division, found in the following equation:
x=(|_x_B¦/S_|×S×C)+(x_A->B×S)+(x_B¦ mod S),
where |_a/b_| represents the integer division of a by b. This equation is equivalent to
x=((x_B¦-(x_B¦ mod S))×C)+(x_A->B×S)+(x_B¦ mod S).
Then, x_B¦ mod S can be computed with the Montgomery method,
x_B¦ mod S=M_S,B(M_S,B(x_B¦,1),b^2s mod S).
The conditions on the base B=b^2s are b^2s>S and b^2s>x_B¦.

In summary, the PEKE cryptosystem may be made more efficient if the above change is adopted and if the parameters b, n, b²ⁿ mod N, s, and b^2s mod S are all pre-computed. The efficiency gain from the precomputation of N₀' and S₀' is less important.

Now, the internals of the multiprecision Montgomery multiplication algorithm may be stated. Again, let B=bⁿ. There is a need for the value of integer N' such that B×(B^-1)-N×N'=1, where B×(B^-1) mod N=1. Actually, only the least significant part of N' is needed, hence the definition N₀'=N' mod b. We reproduce below the simple algorithm from the said article by Dussé and Kaliski to efficiently find N₀' from N₀ and b, when b is an exact power of two. Obviously, N₀' can be computed once and for all.

In the multiprecision Montgomery multiplication algorithm that follows, capital letters are multi-precision variables. The indices are as expected for natural integers, e.g. N₀ is the least significant part of N. The algorithm is an application of the convolution-sum method for the multiplication. The variable c is an accumulator with sufficient capacity for a sum of products and multiple carry bits from the additions (up to 2n of them).

The variable R_n is actually a local storage area of this algorithm (like Q and lower case variables). Consequently, the storage requirement for R is the same as for X and Y. Moreover, if X and Y are the same variable, as in the modular squaring operation of the PEKE cryptosystem, the storage for X, Y, and R can be the same.

CONNOTECH Experts-conseils Inc.
9130 Place de Montgolfier
Montréal, Québec, Canada, H2M 2A1
Tél.: +1-514-385-5691 Fax: +1-514-385-5900