Public key cryptography implies the use of large integer arithmetic operations, with a high proportion of modulus operations (as a reminder, the modulus operation is the remainder of an integer division). The division is by far the most compute intensive of the four elementary operations to the point where the the performance of a public key cryptosystem is always deceptively low.
To make these computations relatively efficient, the Montgomery modulo reduction algorithm is very useful. The mathematically inclined reader may refer to the original article by Peter L. Montgomery, Modular Multiplication Without Trial Division (in Mathemetics of computations, Vol. 44, no. 170, April 1985, pp 519-522). Two implementations of the Montgomery algorithm are reported, respectively
Simplification and adaptation of these references may be useful to the Montgomery method in the context of a general purpose computer (for software implementation). This is the subject of the present document. The intended readership includes the computer programmers and analysts.
The Montgomery method is useful when repeated modulus operations are needed with a constant divisor. When a public key cryptosystem requires the computation of a "modular inverse", the Montgomery method is less useful because the divisor is not a constant in this computation. Although the Montgomery method is useful for many public key cryptosystems, a particular focus of this document is the PEKE cryptosystem. For one of the two parties taking part of the PEKE cryptosystem protocol, the computation load is relatively small for a public key cryptosystem. The use of a relevant optimization technique may make the difference between unacceptable and bearable performance characteristics for a PEKE cryptosystem implementation.
The computation facilitated with the Montgomery method is
repetitive "modular reduction" operations
"mod N
", where
N
is odd integer constant. This is because there are precomputations
(on
N
) of which processing cost cannot be neglected. Let
b
be the natural computer word size of the target computer and
n
be such that
bn>N
. The values to be pre-computed are
b2n mod N
, and also
N0'
to be specified hereafter.
Montgomery multiplication, in the multiple precision case as with
public key cryptography, allows fast modular arithmetic for a modulus
N
relatively prime to
bn
, where
bn>N
, and arithmetic modulo
b
is easy (e.g.
2767<N<2768
,
N
is odd,
b=28
,
n=96
, so
bn=2768
). Let
B=bn
(not to be confused with
Bi
, a notation proper to the PEKE specification). Given
T
satisfying
0<=T<=B×N
, the Montgomery reduction algorithm efficiently computes
T×(B-1) mod N
. Our focus with PEKE is the modular
squaring, but we will show modular
multiplication for sake of generality, so let
T=x'×y'
where
x'=x×B mod N
and
y'=y×B mod N
; then the Montgomery reduction algorithm efficiently computes
x'×y'×(B-1) mod N=x×y×B mod N
.
Let us use the notation
MN,B(X,Y)
for the result of
X×Y×(B-1) mod N
, using the Montgomery reduction algorithm. Then
x×y×B mod N=MN,B(x×B mod N,y×B mod N)
. To convert an integer
x
,
0<=x<N
, into
x×B mod N
, compute
MN,B(x,b2n mod N)
. The value
b2n mod N
should be pre-computed once and for all; this must be done with a
general purpose division operation. See for instance the article by Per
Brinch Hansen,
Multiple-length Division Revisited: a Tour of the
Minefield (in Software Practice and Experience, vol. 24(6), June 1994, pp
579-601). To recover
x
from
x×B mod N
, compute
MN,B(x×B mod N,1)
. There are two routes to complete a single modular multiplication:
MN,B(MN,B(x,y),b2n mod N)=x×y mod N
, or by pre-computing
x'=MN,B(x,b2n mod N)
and
y'=MN,B(y,b2n mod N)
, and then
MN,B(MN,B(x',y'),1)=x×y mod N
. Whenever a series of multiplications is performed on a same set of
inputs or intermediate results, the latter route is more efficient.
A series of multiplications occurs in the PEKE cryptosystem whenever
the PEKE parameter
t
is greater than 1. Moreover, the PEKE cryptosystem may be changed
without altering its security to fully exploit the Montgomery method. If
this is done, the PEKE equations are restated as
x'=MN,B(x,b2n mod N)
,
x0'=MN,B(x',x')
, instead of the original
x0=x2 mod N
,
xi+1'=MN,B(xi',xi')
, instead of the original
xi+1=xi2 mod N
, and
Bi=xi' mod 2k
, instead of the original
Bi=xi mod 2k
.
Formally, this requires an additional public parameter for the PEKE
cryptosystem, that is the base
bn
. The conversion from the value
xt'=xt×bn mod N
to the value
xt
is done with the equation
xt=MN,B(xt',1)
;
it may be assigned to the initiator of the PEKE protocol to further
reduce the computation load for the responder.
Moreover, the original PEKE specification requires another integer
division, found in the following equation:
x=(|_xB¦/S_|×S×C)+(xA->B×S)+(xB¦ mod S)
,
where
|_a/b_|
represents the integer division of
a
by
b
. This equation is equivalent to
x=((xB¦-(xB¦ mod S))×C)+(xA->B×S)+(xB¦ mod S)
.
Then,
xB¦ mod S
can be computed with the Montgomery method,
xB¦ mod S=MS,B(MS,B(xB¦,1),b2s mod S)
.
The conditions on the base
B=b2s
are
b2s>S
and
b2s>xB¦
.
In summary, the PEKE cryptosystem may be made more efficient if the
above change is adopted and if the parameters
b
,
n
,
b2n mod N
,
s
, and
b2s mod S
are all pre-computed. The efficiency gain from the precomputation of
N0'
and
S0'
is less important.
Now, the internals of the multiprecision Montgomery multiplication
algorithm may be stated. Again, let
B=bn
. There is a need for the value of integer
N'
such that
B×(B-1)-N×N'=1
, where
B×(B-1) mod N=1
. Actually, only the least significant part of
N'
is needed, hence the definition
N0'=N' mod
b
. We reproduce below the simple algorithm from the said article by
Dussé and Kaliski to efficiently find
N0'
from
N0
and
b
, when
b
is an exact power of two. Obviously,
N0'
can be computed once and for all.
In the multiprecision Montgomery multiplication algorithm that
follows, capital letters are multi-precision variables. The indices are as
expected for natural integers, e.g.
N0
is the least significant part of
N
. The algorithm is an application of the convolution-sum method for
the multiplication. The variable
c
is an accumulator with sufficient capacity for a sum of products and
multiple carry bits from the additions (up to
2n
of them).
The variable
Rn
is actually a local storage area of this algorithm (like
Q
and lower case variables). Consequently, the storage requirement for
R
is the same as for
X
and
Y
. Moreover, if
X
and
Y
are the same variable, as in the modular squaring operation of the
PEKE cryptosystem, the storage for
X
,
Y
, and
R
can be the same.
CONNOTECH Experts-conseils Inc.
9130 Place de Montgolfier
Montréal, Québec, Canada, H2M 2A1
Tél.: +1-514-385-5691
Fax: +1-514-385-5900