Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Subterranean 2.0: A Lightweight Cryptographic Primitive for Resource-Constrained Platforms, Study notes of Computer Security

Subterranean 2.0, a cryptographic primitive suitable for low-area and low-energy implementations in dedicated hardware. It covers the design, interface, and rationale behind this permutation-based crypto algorithm, which can be used for hashing and as a stream cipher. The document also discusses techniques for software optimizations and security strengths against various attacks.

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

claire67
claire67 🇬🇧

4.6

(5)

265 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Subterranean 2.0 cipher suite
Joan Daemen, Pedro Maat Costa Massolino and Yann Rotella
Radboud University, Digital Security Department, Nijmegen
Abstract.
This paper presents the Subterranean 2.0 cipher suite that can be used
for hashing, MAC computation, stream encryption and several types of session
authenticated encryption schemes. At its core it has a duplex object with a 257-bit
state and a lightweight single-round permutation. This makes Subterranean 2.0 very
well suited for low-area and low-energy implementations in dedicated hardware.
Version of the Subterranean suite: 2.0
Version of this document: 1.1, March 29, 2019
Keywords:
lightweight, permutation-based crypto, deck function, XOF function,
session authenticated encryption
1 Introduction
Subterranean is a cryptographic primitive to be used both for hashing and as a stream
cipher and dates back to 1992 [17,18]. With some imagination its mode can be seen as a
precursor to the sponge [7] with an absorbing phase followed by a squeezing phase.
The round function of Subterranean has features that were adopted in several designs
over the last three decades, including Keccak
-p
[12] and Xoodoo [21]. Namely, all
its steps, except the addition of a constant, are bit-level shift-invariant operations, its
non-linear step is
χ
, the mixing step is a lightweight bit-oriented mapping with a heavy
inverse and it has a bit transposition step.
But the Subterranean round function also differs from Keccak
-p
and Xoodoo in
important ways. Namely, its state is essentially one-dimensional rather than 3-dimensional,
due to the particular transposition and the 257-bit state, it is not software-friendly, and it
has a buffer, similar to the belt in belt-and-mill designs such as RadioGatún [6].
Despite the differences, it only takes some minor refurbishing to turn Subterranean
into a lightweight symmetric cipher suite that can compete with new designs, at least
when implemented in dedicated hardware. Refurbishing we did, and we call the result
Subterranean 2.0. In short, it is Subterranean with the buffer removed and the hashing
and stream encryption modes replaced by a duplex-based construction with modes on
top, inspired by Xoodyak [19]. The result is very efficient in hardware but not suited
for software. We believe this makes sense in resource-constrained platforms, in particular,
when energy per bit is the primary concern and relatively short messages must be protected.
The design of Subterranean makes no compromise to be efficient in software, giving it an
exceptionally good trade-off between safety margin and hardware performance.
Subterranean 2.0 operates on a state of 257 bits. The modernization into a duplex
object required updating the output extraction and the input injection. For the former, we
have opted to extract a 32-bit string
z
per duplex call, where each bit of
z
is the sum of 2
state bits. For the latter, we inject a string
σ
of up to 32 bits per duplex call in keyed mode
and up to 8bits every two rounds in unkeyed mode. The central function of the duplex
object is the application of a permutation to the state, the subsequent injection of the input
string and the extraction of the output. On top of this a number of wrapper functions
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Subterranean 2.0: A Lightweight Cryptographic Primitive for Resource-Constrained Platforms and more Study notes Computer Security in PDF only on Docsity!

The Subterranean 2.0 cipher suite

Joan Daemen, Pedro Maat Costa Massolino and Yann Rotella

Radboud University, Digital Security Department, Nijmegen

Abstract. This paper presents the Subterranean 2.0 cipher suite that can be used for hashing, MAC computation, stream encryption and several types of session authenticated encryption schemes. At its core it has a duplex object with a 257 -bit state and a lightweight single-round permutation. This makes Subterranean 2.0 very well suited for low-area and low-energy implementations in dedicated hardware. Version of the Subterranean suite : 2. Version of this document : 1.1, March 29, 2019 Keywords: lightweight, permutation-based crypto, deck function, XOF function, session authenticated encryption

1 Introduction

Subterranean is a cryptographic primitive to be used both for hashing and as a stream cipher and dates back to 1992 [17, 18]. With some imagination its mode can be seen as a precursor to the sponge [7] with an absorbing phase followed by a squeezing phase. The round function of Subterranean has features that were adopted in several designs over the last three decades, including Keccak- p [12] and Xoodoo [21]. Namely, all its steps, except the addition of a constant, are bit-level shift-invariant operations, its non-linear step is χ , the mixing step is a lightweight bit-oriented mapping with a heavy inverse and it has a bit transposition step. But the Subterranean round function also differs from Keccak- p and Xoodoo in important ways. Namely, its state is essentially one-dimensional rather than 3-dimensional, due to the particular transposition and the 257-bit state, it is not software-friendly, and it has a buffer, similar to the belt in belt-and-mill designs such as RadioGatún [6]. Despite the differences, it only takes some minor refurbishing to turn Subterranean into a lightweight symmetric cipher suite that can compete with new designs, at least when implemented in dedicated hardware. Refurbishing we did, and we call the result Subterranean 2.0. In short, it is Subterranean with the buffer removed and the hashing and stream encryption modes replaced by a duplex-based construction with modes on top, inspired by Xoodyak [19]. The result is very efficient in hardware but not suited for software. We believe this makes sense in resource-constrained platforms, in particular, when energy per bit is the primary concern and relatively short messages must be protected. The design of Subterranean makes no compromise to be efficient in software, giving it an exceptionally good trade-off between safety margin and hardware performance. Subterranean 2.0 operates on a state of 257 bits. The modernization into a duplex object required updating the output extraction and the input injection. For the former, we have opted to extract a 32 -bit string z per duplex call, where each bit of z is the sum of 2 state bits. For the latter, we inject a string σ of up to 32 bits per duplex call in keyed mode and up to 8 bits every two rounds in unkeyed mode. The central function of the duplex object is the application of a permutation to the state, the subsequent injection of the input string and the extraction of the output. On top of this a number of wrapper functions

are defined for absorbing arbitrary-length strings, possibly combined with encryption or decryption, for performing blank rounds and for squeezing arbitrary-length strings. Loyal to the original Subterranean, we chose for the permutation f in duplex to have only one round and so we expect there to be attacks better than generic ones, i.e., those not exploiting the specifics of f. In particular we claim 128 bits of security against multi-target attackers in keyed modes and 112 bits in unkeyed modes. In Section 2 we specify the Subterranean duplex object, the primitive underlying the schemes we propose and the three cryptographic schemes that are specified as modes of on top of it: an eXtendable Output Function (XOF), a Doubly-Extendable Cryptographic Keyed (deck) function and a Session Authenticated Encryption (SAE) scheme. This is not meant to be exhaustive but covers most use cases: the XOF for hashing, the deck function for MAC computation, stream encryption, key derivation and more sophisticated modes such as those specified in the Xoodoo cookbook [19] and the SAE scheme for compact authenticated encryption. In Section 3 we provide the design rationale of the Subterranean 2.0 cipher suite. In Section 4 we discuss how Subterranean should optimally be implemented. In Section 5 we discuss techniques for software optimizations that have played a role in the choice of bit positions for output and input. Finally, Section 6 discusses parameters to be used in the NIST lightweight competition.

2 Specification of the Subterranean 2.0 suite

We specify the Subterranean 2.0 suite in a bottom-up fashion, starting with the round function, input injection and output extraction in Section 2.1, the Subterranean 2.0 duplex object in Section 2.2, the XOF function in Section 2.3, the deck function in Section 2. and the SAE scheme in Section 2.5.

2.1 The round function R, input injection and output extraction

The round function R operates on a 257 -bit state and has four steps:

R = πθιχ , (1)

where each step is there for a particular purpose: χ for non-linearity, ι for asymmetry, θ for mixing and π for dispersion. We denote the state as s and its bits as si with position index i ranging from 0 to 256 , where any expressions in the index must be taken modulo 257. For all 0 ≤ i < 257 :

χ : sisi + ( si +1 + 1) si +2 , ι : sisi + δi , θ : sisi + si +3 + si +8 , π : sis 12 i.

Here the addition and multiplication of state bits are in F 2 , and δi is a Kronecker delta: δi = 1 if i = 0 and 0 otherwise. Figure 1 illustrates the round function by the computational graph of a single bit of the state. At the core of the Subterranean duplex object is a simple (internal) duplex call that first applies the Subterranean round function R to the state and then injects a string σ of variable length of at most 32 bits. Before adding it into the state, it pads the string σ to 33 bits with simple padding ( 10 ∗) and hence the injection rate is 33 bits. In between duplex calls, one may extract 32 -bit strings z from the state, so the extraction rate is 32 bits. Each of the 32 bits of the extracted output z is constructed as the sum of two state bits. These are taken from 64 fixed positions that are the elements of the multiplicative

Algorithm 1 Subterranean duplex object

Interface: Constructor: Subterranean() s ← 0257

Interface: Y ← absorb( X, op) with op ∈ {unkeyed , keyed , encrypt , decrypt} if op = unkeyed then w = 8 else w = 32 Let x [ n ] be X split in w -bit blocks, with last block strictly shorter Y for all blocks of x [ n ] do if op ∈ {encrypt , decrypt} then temp ← x [ i ] + (extract( s ) truncated to |x[i]|) YY ||temp if op = decrypt then duplex(temp) else duplex( x [ i ]) if op = unkeyed then duplex(  ) return Y

Interface: blank( r ) with r a natural number for r times do duplex(  )

Interface: Z ← squeeze( _ ) with _ a natural number Z while | Z | < _ **do** _Z_ ← _Z_ ||extract( _s_ ) duplex( __ ) **return** _Z_ truncated to _ bytes

Internal interface: duplex( σ ) with | σ | ≤ 32 s ← R( s ) xσ || 1 || 032 −| σ | for j from 0 to 32 do s 124 js 124 j + xj

Internal interface: z ← extract( s ) z for j from 0 to 31 do zz ||( s 124 j + s − 124 j ) return z

Basically, our claim corresponds to a security strength of 112 bits against all attacks that do not apply to a random oracle. The capacity in the claim is 24 bits short of the effective capacity 257 − 9 = 248 bits to account for possible shortcut attacks. These are attacks that are more efficient than generic ones by exploiting Subterranean-specific properties (see section 3.4)

Algorithm 2 Subterranean-XOF

Interface: Z ← Subterranean-XOF( M [[ n ]] , _ ) with _M_ [[ _n_ ]] a string sequence and _ a natural number S ← Subterranean() for all strings M [ i ] in M [[ n ]] do S. absorb( M [ i ] , unkeyed) S. blank(8) return ZS. squeeze( ` )

2.4 The Subterranean-deck function

We specify Subterranean-deck in Algorithm 3. It takes as input an arbitrary-length key K and a sequence of an arbitrary number of arbitrary-length strings M [ i ], denoted as M [[ n ]] and returns a bit string of arbitrary length. It can readily be used as a stream cipher, a MAC function and for key derivation. The Farfalle paper [5] and the Xoodoo cookbook [19] specify several authenticated encryption modes for deck functions. We claim Subterranean-deck offers 128 bits of security against adversaries that are limited to 296 data blocks, when it is loaded with min-entropy 128 bits in a single-target settings, and loaded with independent keys with min-entropy 128 + log 2 μ in a multi-target setting with μ targets. Subterranean-deck absorbs the key in blocks of 32 bits. This makes it an application of the recent result of Bart Mennink [25]. The bottom line is that, even for a uniform key with length k and an ideal underlying permutation, the success probability of key prediction cannot be proven to be close to N 2 − k^ for N operations On the other hand, the absence of this bound does not imply there is an attack with success probability above N 2 − k^ and we do not take this into account in our claim.

Claim 2. The advantage of an adversary in distinguishing an array of μ Subterranean-deck instances loaded with μ independent keys, each with min-entropy 128 + log 2 μ bits, from an array of μ independent random oracles. is upper bound by ( N + M )2−^128 , with N the total computational complexity in calls to the Subterranean round function and M the total data complexity in 32 -bit input- and output blocks, with M ≤ 296_._

The data limit for the adversary, M < 296 is not likely to pose a problem in the foreseeable future.

Algorithm 3 Subterranean-deck Interface: Z ← Subterranean-deck( K, M [[ n ]] , _ ) with _M_ [[ _n_ ]] a string sequence and _ a natural number S ← Subterranean() S. absorb( K, keyed) for all strings M [ i ] in M [[ n ]] do S. absorb( M, keyed) S. blank(8) return ZS. squeeze( ` )

2.5 The Subterranean-SAE authenticated encryption scheme

We specify Subterranean-SAE in Algorithm 4. It takes a nonce when starting the session and can then encipher and authenticate a sequence of messages each consisting of a plaintext and associated data. Compared to authenticated encryption modes based on Subterranean-deck, Subterranean-SAE has smaller state and is better suited to offer protection against differential power analysis (DPA). In particular, the security is based on the secrecy of a state that evolves during the session rather than a static key. Across sessions, one can derive a fresh key per session using Subterranean-deck. This protects against differential power analysis and provides fine-grained forward secrecy. If one wishes to use the same key for multiple sessions and DPA is a concern, one can absorb the nonce bit per bit, as was proposed by Taha and Schaumont in [28]. By taking as nonce the shortest binary representation of a session counter, this can be quite economical. For Subterranean-SAE, we basically make the same security claim as for Subterranean- deck with two differences. First, we do not try to distinguish it from a random oracle, but from a random function with the same interface as Subterranean-SAE. Second, we

We extract the output z from statebits that are in positions distant from each other to make it very hard to reconstruct the secret state from a series of outputs z and to prevent measurable bias in the output stream Z. Likewise, we inject input strings σ in the state in positions distant from each other to make it infeasible to control difference propagation in the state. In unkeyed absorbing we limit the strings σ in length to 8 bits and we apply two rounds in between input injections. The reason for this is to make the generation of state collisions infeasible. Subterranean-XOF and Subterranean-deck each have a single absorbing phase followed by a squeezing phase. In between those phases there is a sequence of 8 blank rounds. In Subterranean-deck, these blank rounds are meant to prevent measurable correlations or exploitable differentials between input M [ i ] and output Z. In Subterranean-XOF they are meant to make the generation of collisions in n -bit outputs, for any n ≤ 224 , infeasible. Similarly, they should prevent the generation of (2nd) pre-images for any n -bit output with n ≤ 112 in less than 2 n^ operations. More generally, in Subterranean-XOF, Subterranean-deck and Subterranean-SAE alike, the blank rounds should make the output Z (whether tag or keystream) depend on all bits of the input in a complex way, as is the case for a random oracle. Finally, the blank rounds should prevent attacks that make use of higher order differentials such as cube attacks by the fact that expressions of state bits as a function of the state 8 rounds ago has high degree and is relatively dense. The size of the state, 257 , fits nicely the ambition to offer a security of 128 bits in keyed modes and 112 bits in unkeyed mode. It is rather small but not too small to fall prey to time-memory-data-precomputation trade-offs.

3.2 The round function

The round function is just taken from the original Subterranean specified in [18], with the buffer addition removed and the non-linear step complemented. In short, it is a classical lightweight bit-oriented wide trail design, with a non-linear layer, a mixing layer, a transposition layer and a (round) constant addition. The former three are shift-invariant and the addition of the constant is just to ensure the round function itself is not shift- invariant. The state is a one-dimensional array of length 257 , a prime. This had to be at least 256 as the original Subterranean targeted a security strength of 128 bits. A prime was taken to avoid the existence of exploitable symmetries.

3.2.1 The non-linear layer χ

The non-linear layer is χ , well known from Keccak- p. This is the most sparse shift- invariant mapping of algebraic degree 2 that is invertible when the state has odd length. By sparse we mean that each output bit only depends on few input bits, in this case 3 bits in neighbouring positions. The low degree is an advantage when countermeasures against differential power analysis need to be implemented, such as masking or threshold schemes. In general, computing the inverse of χ requires a recursive procedure and the consequence is that it is dense and has an algebraic degree that is proportionate to the state length. In the case of Subterranean, the inverse of χ has algebraic degree 128. This complexity helps in frustrating cryptanalysts.

3.2.2 The mixing layer θ

The mixing layer θ is similarly sparse: each output bit is the sum of 3 input bits at fixed relative offsets. The Subterranean round function was an improved version of that of the very first wide-trail design, the hash function Cellhash [20]. In Cellhash, the offsets in θ were − 3 , 0 , 3 : symmetric around 0 and at minimum distance so that each output bit of

θχ depends on 9 bits. Due to this choice of offsets, there is a 2 -bit input difference (resp. output mask) that leads to a 2 -bit output difference (resp. input mask). This feature can be used to build n -round trails with weight n 2 n. The choice of offsets in Subterranean avoid this problem: a 2 -bit input difference leads to an output difference with at least 4 bits. The symmetry around 0 was abandoned, 3 was kept and 8 was chosen as the smallest value that gives excellent mixing properties. For studying the algebraic properties of θ , it helps to see the state s as a binary polynomial

i siX

i. The operation of θ then becomes a modular multiplication:

θ ( s ( X )) = s ( X )(1 + X^3 + X^8 ) mod (1 + X^257 ).

We say 1 + X^3 + X^8 is the multiplication polynomial of the linear shift-invariant mapping θ. The polynomials P ( X ) of degree smaller than 257 that are coprime to 1 + X^257 form a group that we will denote by Θ. The modulus 1 + X^257 is the product of 1 + X with 16 irreducible polynomials of degree 16 as shown in table 7. As

P^2

n ( X ) mod

1 + X^257

= P ( X^2

n (^) mod 257 ) ,

we have P^2

n ( X ) mod (1+ X^257 ) = P ( X ) if n is the order of 2 in (Z / 257 Z∗ , ×). The order of 2 happens to be 16 , implying that the order of any P ( X ) ∈ Θ divides 216 − 1 = 3 · 5 · 17 · 257. For P ( X ) = 1 + X^3 + X^8 , we verified with sage that the order is 216 − 1 itself. It follows that θ has the maximum order of any element in Θ. The inverse of θ can be computed as

(1 + X^3 + X^8 )^2

(^16) − 2 and has a Hamming weight of 127. This high diffusion in the backward direction helps in frustrating cryptanalysis.

3.2.3 The transposition π

The transposition layer π puts bits that are 12 positions apart next to each other: sis 12 i. This ensures that each state bit depends on 81 bits of the state 2 cycles ago. Dually, it moves neighbouring bits to positions 150 bits apart: s 150 jsj as 150 · 12 mod 257 = 1. The result is that a single-bit difference in the state may affect 81 state bits 2 cycles later. The order of 12 in (Z / 257 Z)∗ , ×) is 256 , or in other words, it is a generator. The consequence is that the order of the transposition π is likewise 256.

3.2.4 The order of the linear layer π θ

For understanding the order of the linear layer πθ we use the following observation. For all i ∈ Z, let θ ( i )^ be the linear transformation defined as θ ( i )^ = πi^ ◦ θπi. Then, for all n ∈ N, the linear transformation ( πθ ) n^ can be converted into

πn^ ◦ θ ( n −1)^ ◦ θ ( n −2)^ · · · θ (1)^ ◦ θ (0)^.

Indeed, ( πθ ) n^ = πθπθ ◦ · · · ◦ πθ = πn^ ◦ π^1 − n^ ◦ θπn −^1 ◦ π^2 − n^ ◦ θ ◦ · · · ◦ π^1 ◦ θ. If we take this expression for n = 256, the first term becomes the identity and we obtain a mapping that is the composition of 256 linear shift-invariant mappings, each with its own multiplication polynomial. From this follows that the composed mapping is also a linear shift-invariant mapping and that its order divides 216 − 1. Hence, the order of πθ divides 28 (2^16 − 1). We checked the divisors of this integer and it turns out that the order of πθ is 256 and the minimal polynomial of πθ is 1 + X^256. More in general, we can prove the following lemma.

Lemma 1. Let θ ′^ be a linear shift-invariant mapping with multiplication polynomial P and let π ′^ be defined as sisg × i and ord( g ) the multiplicative order of g in Z / 257 Z∗. If ord( g ) ≥ 16 , the order of π ′^ ◦ θ ′^ divides ord( g ).

A string sequence M [[ n ]] gives rise to a sequence of absorb calls, that each split the string M [ i ] into 8 -bit blocks, pad these blocks with 10 ∗^ and then sequentially inject these into the state in a series of duplex calls. The borders between the strings in the sequence are marked by the fact that the last block of each string M [ i ] before padding is shorter than 8 bits. If the strings M [ i ] are bit strings, an adversary can hence freely choose 9 bits between two consecutive duplex calls, as there is a blank round between each absorbing phase in unkeyed mode. In the case of byte strings, the adversary can choose from 28 + 1 input values, reducing the number of freely chosen bits per duplex call to 8 +  with  small. Our ambition is to have 112 -bit security strength in the more general case of bit strings. When generating a state collision, this may occur between string sequences M [[ n ]] and M ′[[ n ′]] that give rise to sequences of duplex calls of equal length or of different length. For duplex call sequences of equal length, we can try to find a differential from the input M [[ n ]] to a zero difference in the state with a high differential probability. For duplex call sequences of different length, one could try to generate a fixed point. Those properties are discussed respectively in section 3.4.3 and section 3.4.2. Finally, one can try to find state collisions with a generic attack by just randomly trying inputs and count on the birthday bound for collisions to occur. This is the starting point of the attack explained in the following subsection.

3.4.1 Advanced inner collisions

In absorbing, we call the bit positions where the input is injected the outer part of the state. The inner part of the state is formed by the other bit positions. In unkeyed absorbing, the outer part consists of 9 bit positions and the inner part 248 bit positions. In keyed absorbing, the outer part is 33 bits wide and the inner part 224 bits. In a naive version of the birthday attack, we need to try about 2 (257+1) /^2 = 2^129 inputs to find a collision in the state. This can be reduced to about 2124 inputs if we relax the state collision requirement somewhat, by only requiring a collision in the 248 -bit inner part of the state. We call this an inner collision. An inner collision can readily be converted into a state collision by compensating for the (possible) difference in the 9 -bit outer part by choosing the last blocks in the inner-state colliding inputs. The expected number of string sequences that must be tried before an inner collision presents itself is about 2125 , taking about 2 125+1^ = 2^126 duplex calls as there are two rounds per 8 -bit block. This is expected workload of a generic attack and decreasing it by a factor 212 by exploiting specifics of the round function would break our security claim for Subterranean- XOF. As the round function of Subterranean is rather sparse and has a degree of only 2 , it is not unthinkable that this could be done. Exploiting specificities of the round function we did, and found a state-collision finding attack that takes roughly 2 116+1^ duplex calls. It is an attack on a weakened variant of Subterranean, where during unkeyed absorbing an input block is absorbed every round. This attack is exactly the reason why we chose to reduce this to one block every two rounds in unkeyed absorbing. In a first phase of the attack, the birthday phase , we compute the states s obtained by absorbing many random input messages and assemble them in what we call the birthday set. It is the size of this set that determines the attack complexity. In a second phase of the attack, we identify pairs of states ( s, s ′) in the birthday set that form what we call an advanced inner collision. To form an advanced inner collision, the state values of the pair must satisfy certain equations and for a number of equations involving bits of ( s, s ′) a solution must exist. The generic scheme of the attack is depicted in figure 2. We will now derive the equations the bits of an advanced inner collision must satisfy. This allows us to estimate the required size of the birthday set and hence the attack complexity. We denote the value of the state(s), right after absorbing the last message

mi

R

m − 2

R s

m − 1

R s^0

m 0

random blocks? chosen blocks?

Figure 2: Finding state collisions in unkeyed absorbing. We want a collision on s^0 , bits in blue are input block bits , only their difference matters. Bits in red represent conditions that can be satisfied by choosing the value of m − 1 , where mi ’s are message blocks concatenated with the padding bit.

blocks m 0 and m ′ 0 , s 0 and index the iteration backwards with − 1 , − 2 etc. Our equations are in the bits of s − 1 and s ′− 1 and for readability we will abbreviate these to simply s and s ′.

For the message difference m 0 ⊕ m ′ 0 , we can choose 9 bits (in blue in Figure 2). This is equivalent to the statement 1 ≤ j ≤ 248 , qj ( s ) = qj ( s ′), where qj are quadratic functions defined by the round function R and s and s ′^ are state values in the birthday set. In other words, we express bits in s^0 as functions of the state in s.

An attacker can control 9 bits in both states s and s ′. We denote bits of m − 1 by b 0 , b 1 ,... , b 7 , b 8 and bits of m ′− 1 by b ′ 0 , b ′ 1 ,... , b ′ 7 , b ′ 8 Those bits are injected at positions 1 , 176 , 136 , 35 , 249 , 134 , 197 , 234 , 64. Each bit bi (or bi ), for 0 ≤ i ≤ 8 will appear in 9 equations qj where qj are quadratic functions defined above. By doing a Gauss pivot on the 248 equations we can minimize the number of equations where bi or bi appear.

We clarify this by explaining this in detail for equations where bits b 2 , b ′ 2 , b 5 and b ′ 5 intervene. Those bits are respectively injected at position 136 and 134. This yields equations of the form:

                      

q 124 ( s ) + q 124 ( s ′) = b 5 s 133 + b ′ 5 s ′ 133 q 125 ( s ) + q 125 ( s ′) = b 5 s 135 + b ′ 5 s ′ 135 q 126 ( s ) + q 126 ( s ′) = b 5 + b ′ 5 + b 2 s 135 + b ′ 2 s ′ 135 q 127 ( s ) + q 127 ( s ′) = b 2 s 137 + b ′ 2 s ′ 137 q 128 ( s ) + q 128 ( s ′) = b 2 + b ′ 2 q 129 ( s ) + q 129 ( s ′) = b 5 s 133 + b ′ 5 s ′ 133 q 130 ( s ) + q 130 ( s ′) = b 5 s 135 + b ′ 5 s ′ 135 q 131 ( s ) + q 131 ( s ′) = b 5 + b ′ 5 + b 2 s 135 + b ′ 2 s ′ 135 q 132 ( s ) + q 132 ( s ′) = b 5 s 133 + b ′ 5 s ′ 133 + b 2 s 137 + b ′ 2 s ′ 137 q 133 ( s ) + q 133 ( s ′) = b 5 s 135 + b ′ 5 s ′ 135 + b 2 + b ′ 2 q 134 ( s ) + q 134 ( s ′) = b 5 + b ′ 5 + b 2 s 135 + b ′ 2 s ′ 135 q 135 ( s ) + q 135 ( s ′) = b 2 s 137 + b ′ 2 s ′ 137 q 136 ( s ) + q 136 ( s ′) = b 2 + b ′ 2

When we find a pair ( s, s ′) that has colliding birthday set coordinates, we try to find values of bi and bi for i from 0 to 8 so that the first 26 equations are also satisfied. We now evaluate the probability that such values can be found. For any i with i different from 2 and 5 , bits bi and bi for different i occur in non- overlapping equations and there are 3 equations per couple ( bi, bi ). An exception to this are equations in bits b 2 , b ′ 2 , b 5 and b ′ 5 that we will address afterwards. We will now work out the case for b 0 and b ′ 0 that corresponds to the first 3 equations.

  • If s 0 = s ′ 0 and s 2 = s ′ 2 , then only the difference b 0 + b ′ 0 matters. This difference is uniquely determined by the first equation and this equation can be satisfied by choosing b 0 + b ′ 0. The next two equations are then both satisfied with probability 2 −^2.
  • If s 0 = s ′ 0 + 1 or s 2 = s ′ 2 + 1, then the difference b 0 + b ′ 0 matters for the first equation, but the absolute value also matters. This uniquely determines the value of b 0 and b ′ 0 by using only 2 equations. The last equation is then satisfied with probability 2 −^1.

Hence, the first three equations can be satisfied with probability

1 4

×

×

This probability is the same for equations where b 1 , b 3 , b 4 , b 6 , b 7 and b 8 intervene. For the five equations where b 2 , b ′ 2 , b 5 and b ′ 5 intervene, the previous analysis does not hold. To obtain the probability that these equations can be satisfied, we can look into the following 8 different events:

  • s 137 = s ′ 137 , s 135 = s ′ 135 , s 133 = s ′ 133 ;
  • s 137 6 = s ′ 137 , s 135 = s ′ 135 , s 133 = s ′ 133 ;
  • s 137 = s ′ 137 , s 135 6 = s ′ 135 , s 133 = s ′ 133 ;
  • s 137 = s ′ 137 , s 135 = s ′ 135 , s 133 6 = s ′ 133 ;
  • s 137 6 = s ′ 137 , s 135 6 = s ′ 135 , s 133 = s ′ 133 ;
  • s 137 6 = s ′ 137 , s 135 = s ′ 135 , s 133 6 = s ′ 133 ;
  • s 137 = s ′ 137 , s 135 6 = s ′ 135 , s 133 6 = s ′ 133 ;
  • s 137 6 = s ′ 137 , s 135 6 = s ′ 135 , s 133 6 = s ′ 133.

By using the same arguments as before, the probability of satisfying the five equations can be expressed as 1 8

So given a pair ( s, s ′) with colliding birthday set coordinates, the probability that we can find trailing blocks m − 1 , m ′− 1 and a difference in m 0 that lead to a state collision is:

p ′^ =

≈ 2 −^10.

Hence, on the average we would need to find about 210 pairs with colliding birthday coordinates. In a birthday set of size 2 w^ the expected number of pairs that collide in birthday coordinates would be

( 2 w 2

2 −^221 ≈ 22 w −^1 −^221 = 2^2 w −^222. Setting this to 210 gives us the required size of the birthday set: 2116. Hence the computational complexity is roughly 2116 duplex calls.

The complexity of this attack is probably dominated by storing the states in the hash table in their order of birthday set coordinates. Assuming n log ( n ) complexity and skipping over the constant factor, this would correspond to 116 × 2116 ≈ 2123 operations. Moreover, it would require an amount of memory enough to store 2116 states each taking a few hundred bits. In order to reduce the birthday set in this attack on weakened Subterranean unkeyed absorbing to size below 2112 , an attacker would have to make use of sets of two trailing blocks m − 2 and m − 1 and one final difference. Equations would become of degree 4 and there would be many more equations involving the input bits. However, as the safety margin between the claimed security strengt of 112 bits, and the complexity of this attack is small, we think the reduction of injecting 9 bits every round to 9 bits every two rounds is justified. To modify this attack for the nominal unkeyed absorbing in Subterranean so that its birthday set would have size below 2112 , an attacker would have to construct equations spanning 4 Subterranean rounds. We believe this to be infeasible.

3.4.2 Fixed points

A fixed point consists of a state value s , reached after absorbing a first sequence of input blocks, and a second sequence of input blocks such that after absorbing that second sequence with duplex calls the state has again value s. Such a fixed point would allow generating an infinite set of colliding input sequences. Finding such a fixed point with a generic attack has the same complexity as generically finding a state collision and we believe there are no shortcut attacks that would reduce the expected complexity by a factor 28.

3.4.3 Differential properties

For duplex call sequences of equal length one may try to generate an inner collision by exploiting a differential or trail with high differential probability (DP). As an inner collision must be obtained in 248 bits of the state and the adversary can choose 9 bits per duplex call in each of the two input block sequences, it is unlikely that starting from some given state, there exist colliding input sequences of less than 248 / (2 · 9) ≈ 14 blocks. Clearly, there are 29 ·^14 = 2^126 input block sequences and just trying them all would just be a generic attack. Doing this in less calls in a systematic way would require controlling the propagation of the difference through the rounds and hence having some kind of high probability differential in Subterranean from the input blocks to the state. We believe such differentials do simply not exist.

3.5 State-recovery attacks

Subterranean-SAE is very similar to the CAESAR competition candidate Ketje Jr [10,11] that was attacked last year [23]. This attack is a state recovery attack on a weakened version of Ketje Jr, where the weakening consists of an increase of the rate during the wrap calls from the nominal 16 bits to 32 bits. The attack focuses on 4 consecutive rounds on Ketje Jr v1. The feasibility of the attack strongly depends on the bit positions of the outer part. In Ketje Jr the outer part covers full ( 5 -bit) rows and the nonlinear mapping operates at row level. This means that if in a state at the input (resp. output) of χ all bits of a row are known, one can compute the bits in that row at the output (resp. input) of χ. This fact allows an attacker to link the information between 4 consecutive rounds. In Ketje Jr v2 the definition of the outer part was changed and no longer contains full rows, greatly reducing the applicability of the attack.

So when combining bits of zt +1 with bits of zt , in the expression of each bit of st + there is always one bit with position i ∈ 12 G 64 (namely sti ), while the bits of st^ will be in positions in G 64. A bit of zt +2 expressed in bits of st^ will contain at least one bit in position i ∈ 144 G 64. Finally, the expression of a bit of zt +3 will contain at least one bit in position i ∈ 176 G 64. During squeezing, the attacker knows 32 sums of 2 state bits each, taken from positions in the multiplicative subgroup generated by 176. Hence, the attacker knows the value of st 176 +1 i + st −+1 176 − i. But, all those bits belong to G 64 , and we know that they can be expressed

as a function of st^ (see equation 3). We have that for all 0 ≤ i ≤ 63 ,

st 176 +1 i + st −+1 176 i = st 12 ∗ 176 i + st − 12 ∗ 176 i + q ( st ).

By construction, we know that 12 ∗ 176 i^ and − 12 ∗ 176 i^ do not belong to G 64. Those bits remain then unknown and their presence guarantees that linear biases on the keystream can only be found for masks U spanning more rounds. More precisely, any mask U for which U T Z would exhibit non-zero bias has a span of at least 4 blocks. We believe this eliminates measurable bias in Z.

3.7 Time-Memory-Data Trade-offs

When squeezing an output, Subterranean behaves like a stream cipher and hence it may be subject to Time-Memory trade-off attacks as specified in [3]. Those attacks can recover the internal state given resources M = T = N/ 2 with M the memory complexity and T the time complexity and N the stream cipher state space. As for Subterranean N = 2^257 , these attacks are not a threat for our claimed security strength of 128 bits. In 2000, Biryukov and Shamir [13] improved the trade-off by bringing the data complexity D and computation in the pre-processing phase P into the equation. The invariances of their trade-off become T P = N and M D = N. As we limit the data complexity to D < 296 and have N = 2^257 , this still does not jeopardize the claimed security strength of 128 bits.

4 Implementation of Subterranean

We did 3 software implementations and 1 hardware/software co-design for FPGAs or ASICs. The software implementations are:

  • reference code in C,
  • a clone of the reference code in Python,
  • memory-compact code in C.

The hardware/software co-design uses the memory-compact code for the mode layers and a Verilog implementation of the Subterranean permutation and some I/O management. In the following subsections we will discuss these implementtions.

4.1 Software code

The reference code in C stores each state bit in a byte, thus making it easier to handle the round function π step and performing input injection and output extractions. This code is very close to the specifications in Algorithms 1, 2, 3 and 4 and therefore easier to understand and debug. It does diverges in the way the absorb function is structured. This function is split along the 4 options unkeyed, keyed, encryption and decryption, for understandability.

We wrote a clone of the reference code in Python. We used this code for debugging and now make it available. In this code the state is stored into a list of integers, where each integer is a state bit. Like the C reference code, this code is meant for understanding the Subterranean 2.0 suite, and not for performance.

4.1.1 Memory-compact code

As a proof of concept, we implemented the Subterranean 2.0 suite that packs sets of 8 statebits in bytes, thus showing it is possible to work with less memory. This byte-packing requires all the bit-oriented transformations to be done with the use of bit masks and shifts. While the low-level state handling functions is quite different from the reference code, the mode-level functions for XOF, deck and SAE are quite close to those in the reference C code and Algorithms 2, 3 and 4. However, in the functions of absorb and duplex the architecture is different from the one described in Algorithm 1. This is because in the memory-compact code, all the state handling operations are done in the duplex and extract calls, therefore having a clear separation of the functions that handles the state. This separation makes it easier to make the hardware/software co-design, because the hardware will be the one handling the state directly. Finally, the code only accepts inputs which are byte multiples, as the NIST submission interface only accepts byte multiples. While the memory compact code will work for any architecture that can process bytes, it is still open to write code for words of 32 or 64 bits.

4.2 Hardware architecture

It makes sense to implement the Subterranean duplex object as a hardware core, and this can be done in different ways. The entire Subterranean 2.0 suite, including XOF, deck and SAE could be done as a hardware core with an internal buffer and state machine or with a hardware/software co-design. In the co-design strategy the serial and communications tasks are usually done by a software on a CPU, and the computation tasks themselves are done in the hardware core. The hardware/software co-design could also be integrated in a SoC, therefore having the state handling operations done in a special circuit, while the communication and serial tasks done in software. We design as a proof of concept a hardware circuit only to handle the state and a software code to handle the communication based on the memory-compact code. In order to better understand, we split the Subterranean algorithm operations into the ones that handle the state and ones that do not. In the hardware implementation we adopted a different interpretation of the low level Subterranean duplex object from Algorithm 1, as (partly) defined in Algorithm 5. In this alternative definition, duplex is subdivided in duplexSimple, duplexEncrypt and duplexDecrypt, while extract becomes squeezeSimple. In the alternative definitions, the padding responsibility is given to the function caller, while some extra functionality is given to the duplex and extract. The duplexEncrypt and duplexDecrypt are made in order to also perform the output extraction and input injection inside the duplex function, while still not handling the padding. The squeezeSimple performs the extract and the blank duplex call in the Algorithm 1. We made this separation to avoid the conditional execution of the original duplex and to have in the hardware core only simple functions that handle the state. The memory-compact software implementations follows a similar approach, except the padding is handled by the inner functions instead of the caller. Our hardware and software co-design architecture is split into 4 parts: the permutation round, the registers with a simple interface , the AXI4-Lite slave interface [1] and the entire SoC system. Figure 3 shows the Subterranean round architecture. This architecture is basically the same as the one shown in Figure 1, the only difference is the addition of the input σ in the θ step. This is done because the 3 input XORs that are done in the θ step can be done

Figure 5: Subterranean simple encapsulated into a AXI4-Lite slave interface.

the Xilinx Zynq SoC, which is a ARM SoC, with external peripherals such as Ethernet, USB, UART etc and the Xilinx FPGA interconnected with different buses, but more importantly an AXI4. With the AXI4 bus interconnecting the FPGA and the CPU, this helps to test and evaluate IP designs that are supposed to be used in a AXI4 bus. Figure 6 shows how we interconnected the ARM CPU and the Subterranean core. The Zynq has a central interconnect which connects the SoC peripherals, the ARM CPU and the FPGA as well, there are other connections like interrupts which are not shown. In our case the CPU is the master of the AXI4 communication, which then is connected to the AXI4 slave port in the FPGA. Inside the FPGA, we need to instantiate an intermediate component that is the AXI Interconnect. This extra component performs the conversion between the AXI4 and the AXI4-Lite protocol. If we implemented Subterranean with a AXI4 interface, then we could directly connect to the central interconnect in the chip. However, the AXI4 interface is quite complex, and therefore it could be done for another implementation.

4.3 Hardware results

The circuits described above where tested and described in Verilog language, and also tested on the Zedboard with the Zynq SoC from Xilinx. The tests applied the KAT produced by the software implementation done in C, which are compared with the reference software KAT. The hardware software co-design was done through Vivado 2017.4 tool, where the software runs on the Zynq ARM CPU that sends the hardware commands through the AXI interface in the FPGA. In this implementation we needed 298 Slices that can be split into 763 LUT and 877 flip-flops. The FPGA was set at 200 MHz (5 ns period), but our circuit could theoretically operate up to 217 MHz. We also synthesized the same circuit in ASIC cells with the open FreePDK 45nm [27] and the open source tool Yosys 0.8 [29]. Yosys does not perform the entire ASIC development

Figure 6: Subterranean core in the Xilinx Zynq FPGA. The communication is bidirectional, the arrows point from communication master to slave.

stack, but gives results for only the gates necessary to implement (therefore no wiring taken into account). It also does not have a fully time oriented synthesis, which tries different combinations and optimizations until it meets the timing requirements. However, in order to have a resources estimation and possible circuit delay this is enough. The results for the ASIC cells are summarized in Table 1. It shows that by adding the registers plus duplex logic, the total area doubled. Which is easy to verify by knowing a flip-flop with no resets/sets occupies 4.25 GE, while the NAND is only 1 GE [27]. Therefore, serial architectures for this scenario are extremely discouraged, since the registers are the biggest bottlenck, and reducing the round logic will give minor optimization gains. However, maybe the construction of a circuit that can perform 2, 3, 4 or more rounds in one cycle might be preferable, since it will give some latency gains while performing XOF computations. Table 1 also shows the critical circuit path results for this technology. Because the round function does not have a big dependency the results are less than 1 ns. Thus adding a register between the computations might not be interesting in this case. In a extreme scenario where someone would need to reduce the critical path, they could be added after the ι step. By inserting the AXI4-Lite interface into our circuit, it increased the amount of resources by approximately 16%. While this is a reasonable amount, if we needed to add a full AXI interface in our design, this number would be bigger. Therefore, designers should also taken into account the communication and environment where the solution will be used.

Table 1: FreePDK 45nm [27] area and delay results for Subterranean duplex. Area (μm^2 ) Area (GE) Critical path (ns) Round logic (Fig. 3) 4602 2452 0. Duplex logic (Fig. 4) 9161 4880 0. AXI4 Lite (Fig. 5) 10655 5676 0.

4.4 Reference software results

Even though our reference code is a guide in order to test and understand, we still evaluate it in terms of memory consumption and timing. All the results were obtained on the virtual machine running on a Intel Core i5-4570 running at 3.2GHz with Hyperthread on and Windows 7. The virtual machine is configured with 16 GB of RAM, 2 CPUs and