Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Alan Turing and the Other Theory of Computation (expanded)*, Schemes and Mind Maps of Theory of Computation

Computer Science Department, Carnegie Mellon University. Abstract. ... are not to recursive/computable analysis (suggested in Turing's seminal 1936 paper),.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 05/11/2023

paperback
paperback 🇺🇸

4.8

(12)

264 documents

1 / 28

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Alan Turing and the Other Theory of
Computation (expanded)*
Lenore Blum
Computer Science Department, Carnegie Mellon University
Abstract. We recognize Alan Turing’s work in the foundations of numerical com-
putation (in particular, his 1948 paper “Rounding-Off Errors in Matrix Processes”), its
influence in modern complexity theory, and how it helps provide a unifying concept
for the two major traditions of the theory of computation.
1 Introduction
The two major traditions of the theory of computation, each staking claim to simi-
lar motivations and aspirations, have for the most part run a parallel non-intersecting
course. On one hand,we have the tradition arising from logic and computer science ad-
dressing problems with more recent origins, using tools of combinatorics and discrete
mathematics. On the other hand, we have numerical analysis and scientific compu-
tation emanating from the classical tradition of equation solving and the continuous
mathematics of calculus. Both traditions are motivated by a desire to understand the
essence of computation, of algorithm; both aspire to discover useful, even profound,
consequences.
While the logic and computer science communities are keenly aware of Alan Turing’s
seminal role in the former (discrete) tradition of the theory of computation, most
remain unaware of Alan Turing’s role in the latter (continuous) tradition, this notwith-
standing the many references to Turing in the modern numerical analysis/computational
*This paper amplifies a shorter version to appear in The Selected Works of A.M. Turing: His Work and
Impact, Elsevier [Blu12] and follows the perspective presented in “Computing over the Reals: Where
Turing Meets Newton,” [Blu04].
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c

Partial preview of the text

Download Alan Turing and the Other Theory of Computation (expanded)* and more Schemes and Mind Maps Theory of Computation in PDF only on Docsity!

Alan Turing and the Other Theory of

Computation (expanded)

Lenore Blum

Computer Science Department, Carnegie Mellon University

Abstract. We recognize Alan Turing’s work in the foundations of numerical com- putation (in particular, his 1948 paper “Rounding-Off Errors in Matrix Processes”), its influence in modern complexity theory, and how it helps provide a unifying concept for the two major traditions of the theory of computation.

1 Introduction

The two major traditions of the theory of computation, each staking claim to simi- lar motivations and aspirations, have for the most part run a parallel non-intersecting course. On one hand, we have the tradition arising from logic and computer science ad- dressing problems with more recent origins, using tools of combinatorics and discrete mathematics. On the other hand, we have numerical analysis and scientific compu- tation emanating from the classical tradition of equation solving and the continuous mathematics of calculus. Both traditions are motivated by a desire to understand the essence of computation, of algorithm; both aspire to discover useful, even profound, consequences.

While the logic and computer science communities are keenly aware of Alan Turing’s seminal role in the former (discrete) tradition of the theory of computation, most remain unaware of Alan Turing’s role in the latter (continuous) tradition, this notwith- standing the many references to Turing in the modern numerical analysis/computational

*This paper amplifies a shorter version to appear in The Selected Works of A.M. Turing: His Work and Impact, Elsevier [Blu12] and follows the perspective presented in “Computing over the Reals: Where Turing Meets Newton,” [Blu04].

mathematics literature, e.g., [Bür10, Hig02, Kah66, TB97, Wil71]. These references are not to recursive/computable analysis (suggested in Turing’s seminal 1936 paper), usually cited by logicians and computer scientists, but rather to the fundamental role that the notion of “condition” (introduced in Turing’s seminal 1948 paper) plays in real computation and complexity.

In 1948, in the first issue of the Quarterly Journal of Mechanics and Applied Mathematics , sandwiched between a paper on “Use of Relaxation Methods and Fourier Transforms” and “The Position of the Shock-Wave in Certain Aerodynamic Problems,” appears the article “Rounding-Off Errors in Matrix Processes.” This paper introduces the notion of the condition number of a matrix, the chief factor limiting the accuracy in solving linear systems, a notion fundamental to numerical computation and analysis, and a notion with implications for complexity theory today. This paper was written by Alan Turing [Tur48].

After the war, with the anticipation of a programmable digital computing device on the horizon, it was of great interest to understand the comparative merits of competing computational “processes” and how accurate such processes would be in the face of inevitable round-off errors. Solving linear systems is basic. Thus for Turing (as it was for John von Neumann [vNG47]), examining methods of solution with regard to the ensuing round-off errors presented a compelling intellectual challenge.^1

(^1) It is clear that Turing and von Neumann were working on similar problems, for similar reasons,

Wilkinson also credits Turing for converting him from a classical to numerical analyst. From 1946 to 1948, Wilkinson worked for Turing at the NPL on the logical design of Turing’s proposed Automatic Computing Engine and the problem of programming basic numerical algorithms:

The period with Turing fired me with so much enthusiasm for the com- puter project and so heightened my interest in numerical analysis that grad- ually I abandoned [the idea of returning to Cambridge to take up research in classical analysis].

Here I would like to recognize Alan Turing’s work in the foundations of numerical computation. Even more, I would like to indicate how this work has seeded a major direction in complexity theory of real computation and provides a unifying concept for the two major traditions in the theory of computation.

2 Rounding-Off Errors in Matrix Processes

This paper contains descriptions of a number of methods for solving sets of linear simultaneous equations and for inverting matrices, but its main concern is with the theoretical limits of accuracy that may be obtained in the application of these methods, due to round-off errors.

So begins Turing’s paper [Tur48]. (Italics are mine, I’ll return to this shortly.) The basic problem at hand: Given the linear system, Ax = b where A is a real non- singular n  n matrix and b 2 Rn. Solve for x 2 Rn. Prompted by calculations [FHW48] challenging the arguments by Harold Hotelling [Hot43] that Gaussian elimination and other direct methods would lead to exponen- tial round-off errors, Turing introduces quantities not considered earlier to bound the magnitude of errors, showing that for all “normal” cases, the exponential estimates are “far too pessimistic.” 4

(^4) In their 1946 paper, Valentine Bargmann, Deane Montgomery and von Neumann [BMvN63] also dismissed Gaussian elimination as likely being unstable due to magnification of errors at successive stages (pp. 430-431) and so turn to iterative methods for analysis. However, in 1947 von Neumann and Herman Goldstine reassess [vNG47] noting, as does Turing, that it is the computed solution, not the intermediate computed numbers, which should be the salient object of study. They re-investigated Gaussian elimination for computing matrix inversion and now give optimistic error bounds similar to those of Turing, but for the special case of positive definite symmetric matrices. Turing in his paper notes that von Neumann communicated these results to him at Princeton [during a short visit] in Jan-

In this paper, Turing introduced the notion of condition number of a matrix making explicit for the first time a measure that helps formalize the informal notion of ill and well-conditioned problems.^5

3 The Matrix Condition Number: Where Turing Meets

Newton

When we come to make estimates of errors in matrix processes, we shall find that the chief factor limiting the accuracy that can be obtained is ‘ill- conditioning’ of the matrices involved [Tur48].

Turing provides an illustrative example:

1 : 4 x + 0: 9 y = 2: 7 0 : 8 x + 1: 7 y = 1 : 2

0 : 786 x + 1: 709 y = 1 : 173 0 : 800 x + 1: 700 y = 1 : 200

The set of equations (8.1) is fully equivalent to (8.2)^6 but clearly if we at- tempt to solve (8.2) by numerical methods involving rounding-off errors we are almost certain to get much less accuracy than if we worked with equations (8.1). ...

We should describe the equations (8.2) as an ill-conditioned set, or, at any rate, as ill-conditioned compared with (8.1). It is characteristic of ill-conditioned sets of equations that small percentage errors in the coefficients given may lead to large percentage errors in the solution.

Turing defines two condition numbers (he calls them N and M), which in essence mea- sure the intrinsic potential for magnification of errors. He then analyzes various stan- dard methods for solving linear systems, including Gaussian elimination, and gets error bounds proportional to his measures of condition. Turing is “also as much interested

uary 1947 before his own proofs were complete. (^5) In sections 3 and 4 of his paper, Turing also formulates the LU decomposition of a matrix (actually the LDU decomposition) and shows that Gaussian elimination computes such a decomposition. (^6) The third equation (in the set of four) is the second plus : 01 times the first.

input x output φ ( x )

φ ( x + Δx )

x + Δx

So let ∆x be a small perturbation of input x and ∆φ = φ(x + ∆x) φ(x). The limit as ∥∆x∥ goes to zero of the ratio ∥∆φ∥ ∥∆x∥

or of the relative ratio ∥∆φ∥ / ∥φ(x)∥ ∥∆x∥ / ∥x∥

(favored by numerical analysts), will be a measure of the condition of the problem instance.^9 If large, computing the output with small error will require increased preci- sion, and hence from a computational complexity point of view, increased time/space resources.

Definition.^10 The condition number of problem instance (φ; x) is defined by

b (φ; x) = lim δ! 0 sup ∥∆x∥δ

∥∆φ∥ ∥∆x∥

and the relative condition number by

 (φ; x) = lim δ! 0

sup ∥∆x∥δ∥x∥

∥∆x∥ j/ ∥φ(x)∥ ∥∆x∥ / ∥x∥

If  (φ; x) is small, the problem instance is said to be well-conditioned and if large, ill- conditioned. If  (φ; x) = 1 , the problem instance is ill-posed.

(^9) All norms are assumed to be with respect to the relevant spaces. (^10) Here I follow the notation in [TB97], a book I highly recommend for background in numerical linear algebra.

As Turing envisaged it, the condition number measures the theoretical limits of accu- racy in solving a problem. In particular, the logarithm of the condition number pro- vides an intrinsic lower bound for the loss of precision in solving the problem instance, independent of algorithm.^11 Thus it also provides a key intrinsic parameter for speci- fying “input word size” for measuring computational complexity over the reals —and in connecting the two traditions of computation— as we shall see in Section 6.

If φ is differentiable, then b(φ; x) = ∥Dφ (x)∥

and (φ; x) = ∥Dφ (x)∥ (∥x∥ / ∥φ (x)∥) ;

where Dφ (x) is the Jacobian (derivative) matrix of φ at x and ∥Dφ (x)∥ is the operator norm of Dφ (x) with respect to the induced norms on X and Y.

Thus we have a conceptual connection between the condition number (Turing) and the derivative (Newton). Indeed, the following theorem says the matrix condition number  (A) is essentially the relative condition number for solving the linear system Ax = b. In other words, the condition number is essentially the (normed) derivative.^12

Theorem.

  1. Fix A, a real non-singular n x n matrix, and consider the map φA : Rn^! Rn^ where φA (b) = A^1 (b). Then (φA; b)   (A) and there exist b such that (φA; b) =  (A). Thus, with respect to perturbations in b , the matrix condition number is the worst case relative condition for solving the linear system Ax = b_._
  2. Fix b 2 Rn^ and consider the partial map φb : Rnxn^! Rn^ where, for A non-singular, φb(A) = A^1 (b). Then for A non-singular, (φb; A) =  (A).

So the condition number  (A) indicates the number of digits that can be lost in solv- ing the linear system. Trouble is, computing the condition number seems as hard as solving the problem itself. Probabilistic analysis can often be employed to glean infor- mation.

(^11) Velvel Kahan points out that “pre-conditioning” can sometimes alter the given problem instance to

a better conditioned one with the same solution. (Convert equations (8.2) to (8.1) in Turing’s illustrative example.) (^12) This inspired in part the title of my paper, “Computing over the Reals: Where Turing meets New- ton” [Blu04].

It is convenient to have a measure of the amount of work involved in a computing process, even though it be a very crude one. … We might, for instance, count the number of additions, subtractions, multiplications, di- visions, recordings of numbers, …

This is the basic approach taken by numerical analysts, qualified as Turing also implies, by condition and round-off errors. It is also the approach taken by Mike Shub, Steve Smale and myself in [BSS89], and later with Felipe Cucker in our book, Complexity and Real Computation [BCSS98]. See also [Blu90], [Sma97], and [Cuc02].

5 Complexity and Real Computation in the Spirit

of Turing, 1948

From the late 1930’s to the 1960’s, a major focus for logicians was the classification of what was computable (by a Turing Machine, or one of its many equivalents) and what was not. In the 1960’s, the emerging community of theoretical computer scientists embarked on a more down to earth line of inquiry —of the computable, what was feasible and what was not— leading to a formal theory of complexity with powerful applications and deep problems, viz., the famous/infamous P = N P? challenge.

Motivated to develop analogous foundations for numerical computation, [BSS89] present a model of computation over an arbitrary field R. For example, R could be the field of real or complex numbers, or Z 2 , the field of integers mod 2. In the spirit of Turing 1948, inputs to the so-called BSS machines are vectors over R and the basic algebraic operations, comparisons and admissible retrievals are unit cost. Algorithms are repre- sented by directed graphs (or in more modern presentations, circuits) where interior nodes are labelled with basic operations, and computations flow from the input to output nodes. The cost of a computation is the number of nodes traversed from input to output.

As in the discrete case, complexity (or cost of a computation) is measured as a function of input word size. At the top level, input word size is defined as the vector length. When R is Z 2 , the input word size is just the bit length, as in the discrete case. Com- plexity classes over R, such as P , N P and EXP , are defined in natural ways. When R is Z 2 , the BSS theory of computation and complexity reduces to the classical discrete theory.

The problem of deciding whether or not a finite system of polynomial equations over R has a solution over R, so fundamental to mathematics, turns out to be a universal N P -complete problem [BSS89]. More precisely, for any field (R; =), or real closed field (R; <), instances of N P -complete problems over R can be coded up as polyno- mial systems such that an instance is a “yes” instance if and only if the corresponding polynomial system has a solution over R.^13 We call this problem the Hilbert Nullstel- lensatz over R, or HNR.

There are many subtleties here. For example, the fact that N P  EXP over Z 2 is a simple counting argument on the number of possible witnesses. Over the reals or complexes, there are just too many witnesses. Indeed, it’s not a priori even clear that in those cases, N P problems are decidable. Decidability in those cases follows from the decidability of HNR by Alfred Tarski [Tar51], and membership in EXP from Jim Renegar’s exponential-time decision algorithms [Ren88a].

New challenges arise: Does P = N P? over the reals or complex numbers, or equiva- lently, is HNR 2 P over those fields? And what is the relation between these questions and the classical P vs. N P challenge?

In attempt to gain new insight or to access more tools, mathematicians often position hard problems within new domains. It is tempting thus to speculate if tools of algebraic geometry might have a role to play in studying classical complexity problems. Salient transfer results: If P = N P over the complex numbers, then BP P  N P over Z 2 [CSS94]. And for algebraically closed fields of characteristic 0 , either P = N P for all, or for none [BCSS96].

We shall return to this discussion in Section 9, but first we introduce condition into the model.

(^13) The notation (R; =) denotes that branching in BSS machines over R are decided by equality

comparisons, while (R; <) indicates that R is an ordered field and branching is now decided by or- der comparisons. When we talk about computing over the complex numbers or Z 2 , we are supposing our machines branch in the former sense, while over the real numbers, we mean the latter.

7 The Condition Number Theorem Sets the Stage

Let n be the variety of singular ( ill-posed ) n  n matrices over R, i.e.,

n = fA 2 RnnjA is not invertible }.

We might expect that matrices close to n would be ill-conditioned while those at a distance, well-conditioned. That is what the Condition Number Theorem says. It provides a geometric characterization of the matrix condition number which suggests, for other computational problems, how condition could be measured.

Σ n

A

The Condition Number Theorem.

 (A) =

∥A∥

dist(A; n)

Here dist(A; n) = inff∥A B∥ jB 2 ng where dist is measured respect to the oper- ator norm or the Frobenius norm. (The Frobenius norm is given by ∥A∥F =

aij 2 , where A = [aij ].)

The Condition Number Theorem is a re-interpretation of a classical theorem of Carl Eckart and Gale Young [EY36]. Although published in 1936, Turing and von Neumann seem not to have been aware of it. Velvel Kahan [Kah66], and later his student Jim Demmel [Dem87]), were the first to exploit it connecting condition with distance to ill-posedness.

Jim Renegar, inspired by the Condition Number Theorem, answers our query on how to measure condition of a linear program, and then uses this measure as a key parameter in the complexity analysis of his beautiful algorithm [Ren88b, Ren95a, Ren95b].^16

of condition, as well as other intrinsic parameters, and to study their relationship to computational complexity.” (^16) Jim tells me that he started thinking about this after a talk I gave at at MSRI during the 1985- Computational Complexity Year [Ren88b].

Recall, the linear programming problem (A; b; c) is to maximize cT^ x such that Ax  b. Here A is a real m  n matrix, b 2 Rm^ and c 2 Rn. Let (A; b) denote the above inequalities, which also define a polytope in Rn. In the following, we assume that m  n. We call

m,n = f(A; b)j(A; b) is on the boundary between the feasible and infeasibleg

the space of ill-posed linear programs.

Let CP (A; b) = ∥(A; b)∥F /distF ((A; b); m,n). Here ∥∥F is the Frobenius norm, and distF is measured with respect to that norm. Similarly, define CD(A; c) for the dual program.

Definition. The Renegar condition number [Ren95b] for the linear program (A; b; c) is given by C(A; b; c) = max[CP (A; b); CD(A; c)]:

Renegar’s algorithm for the LPP [Ren88b] imagines that each side of a (bounded, non- empty) polytope, given by inequalities, exerts a force inward, yielding the “center of gravity” of the polytope. Specifically, the center  is gotten by maximizing

∑^ m

i=

ln ( (^) i  x bi)

where (^) i is the ith row vector of the matrix A, and bi is the ith entry in b.

In essence, the algorithm starts at this initial center of gravity . It then follows a path of centers (approximated by Newton) generated by adding a sequence of new inequalities c  x  k(j)^ (k(j), j = 1; 2 ; :::, chosen so that each successive new polytope is bounded and non-empty). Let k(1)^ = c. If the new center is (1), then k(2)^ is chosen so that k(1)^  k(2)^  c  (1). And so on.

Conceptually, the hyperplane c  x = k(1)^ is successively moved in the direction of the vector c, pushing the initial center  towards optimum.

Hyperplane c · x = k c

Taken together, the two parts give a probability bound on the expected running time of the algorithm, eliminating the condition .

So, how to estimate the (tail) probability that the condition is large?

Suppose (by normalizing) that all problem instances live within the unit sphere. Sup- pose  is the space of ill-posed instances and that condition is inversely proportional to “distance” to ill-posedness. Then the ratio of the volume of a thin tube about  to the volume of the unit sphere provides an estimate that the condition is large. To calculate these volumes, techniques from integral geometry [San04] and geometric measure theory [Fed69] are often used as well as volume of tube formulas of Hermann Weyl [Wey39].

This approach to estimating statistical properties of condition, was pioneered by Smale [Sma81]. Mike Shub and I used these techniques to get log linear estimates for the av- erage loss of precision in evaluating rational functions [BS86]. It is the approach em- ployed today to get complexity estimates in numerical analysis, see e.g., [BCL08].

Many have observed that average analysis of algorithms may not necessarily reflect their typical behavior. In 2001, Spielman and Tang introduced the concept of smoothed analysis which, interpolating between worst and average case, suggests a more realistic scheme [ST01].

The idea is to first smooth the complexity measure locally. That is, rather than focus on running time at a problem instance, compute the average running time over all slightly perturbed instances. Then globally compute the worst case over all the local “smoothed” expectations.

More formally, assuming a normal distribution on perturbations, smoothed running time (or cost ) is defined as

Ts(n; ϵ) = supx 2 Rn ExN (x,σ (^2) )T (x; ϵ):

Here N (x; ^2 ) designates the distribution of x with variance ^2 , and x  N (x; ^2 ) means x is chosen according to this distribution. If  = 0, smoothed cost reduces to worst case cost; if  = 1 then we get the average cost.

Part 2 of Smale’s scheme is now replaced by:

2*. Estimate supx 2 Rn ProbxN (x,σ^2 )fxj(x)  tg

Now, 1. and 2. combine to give a smoothed complexity analysis eliminating . Estimating 2 employs techniques described above, now intersecting tubes about  with discs about x to get local probability estimates.

Dunagan, Spielman and Teng [DSTT02] also give a smoothed analysis of Renegar’s condition number. Assuming   1/m), the smoothed value of log C(A; b; c) is O(log m/). This in turn yields smoothed complexity analyses of Renegar’s linear programming al- gorithm. For the number of iterates:

sup (^) ∥( A,b,c)∥ 1 E (^) (A,b,c)N ( A,b,c), σ (^2) I)♯((A; b; c); ϵ) = O(

p m(log m/ϵ)):

And for the smoothed arithmetic cost:

Ts(m; ϵ) = O(m^3 log m/ϵ):

9 What does all this have to do with the classical P

vs. N P challenge?

This is (essentially) the question asked by Dick Karp in the Preface to our book, Com- plexity and Real Computation [BCSS98].

As noted in Section 5, the problem of deciding the solvability of finite polynomial systems, HN , is a universal N P -Complete problem. Deciding quadratic (degree 2) polynomial systems is also universal [BSS89]. If solvable in polynomial time over the complex numbers C, then any classical N P problem is decidable in probabilistic poly- nomial time in the bit model [CSS94]. While premise and conclusion seem unlikely, understanding the complexity of HNC is an important problem in its own right. Much progress has been made here, with condition playing an essential role.

During the 1990’s, in a series of papers dubbed “Bezout I-V,” Shub and Smale showed that the problem of finding approximate zeros to “square” complex polynomial systems can be solved probabilistically in polynomial time on average over C [SS94, Sma97].

The notion of an approximate zero means that Newton’s method converges quadrat- ically, immediately, to an actual zero. Achieving output accuracy to within ϵ requires only log log 1/ϵ additional steps.

By the implicit function theorem , this lift exits if the line does not intersect the dis- criminant variety of polynomial systems that have zeros with multiplicities. These algorithms approximate the lifting.

To steer clear of the singular variety of ill-posed pairs (i.e., pairs (f; ) where  is a multiple zero of f ), they take into account the condition along the way. The condition will determine appropriate step size at each stage and hence running time.

Major considerations are: how to choose good initial pairs, how to construct good partitions (for approximating the path and steering clear of the singular variety), how to define measures of condition.

In two additional papers Bezout VI [Shu09] and Bezout VII [BS09], Shub and Car- los Beltrán present an Adaptive Linear Homotopy (ALH) algorithm with incremental time step dependent on the inverse of a normalized condition number squared.

Beltrán and Luis Miguel Pardo [BP11] show how to compute a random starting pair yielding a uniform Las Vegas algorithm, polynomial time on average. Utilizing the numerical algebraic geometry package Macaulay2, the randomized algorithm was im- plemented by Beltrán and Anton Leykin [BL12].

Bürgisser and Cucker give a hybrid deterministic algorithm which is almost polynomial time on average [BC11]. Let D be the maximum of the degrees, di. Then for D  n the algorithm is essentially the ALH of Beltrán and Shub with initial pair:

g = (g 1 ; :::; gn) where gi(x 0 ; :::; xn) = 1/

p 2 n(x 0 di^ xidi^ ) and  = (1; :::; 1):

And for D > n, the algorithm calls on Renegar’s symbolic algorithm [Ren89].

The algorithm takes N (loglogN^ )^ arithmetic steps on average, coming close to answering Smale’s question in the affirmative.

For a tutorial on the subject of this section, see [BS12].

10 Postscript: Who Invented the Condition Num-

ber?

It is clear that Alan Turing was first to explicitly formalize a measure that would cap- ture the informal notion of condition (of solving a linear system) and to call this mea- sure a condition number. Formalizing what’s “in the air” serves to illuminate essence and chart new direction. However, ideas in the air have many proprietors.

To find out more about the origins of the spectral condition number , I emailed a number of numerical analysts.

I also looked at many original papers. The responses I received, and related readings, uncover a debate concerning the origins of the (concept of) condition number not unlike the debate surrounding the origins of the general purpose computer —with Turing and von Neumann figuring central to both. (For an insightful assessment of the latter debate see Mike Shub’s article, “Mysteries of Mathematics and Computation” [Shu94].)

Gauss himself [Gau03] is referenced for considering perturbations and precondition- ing. Pete Stewart points to Helmut Wittmeyer [Wit36] in 1936 for some of the earliest perturbation bounds where products of norms appear explicitly. In 1949, John Todd [Tod50] explicitly focused on the notion of condition number, citing Turing’s N and M condition numbers and the implicit von Neumann-Goldstine measure, which he called the P-condition number (P for Princeton).

Beresford Parlett tells me that “the notion was ‘in the air’ from the time of Turing and von Neumann et. al.,” that the concept was used by George Forsythe in a course he took from him at Stanford early in 1959 and that Wilkinson most surely “used the con- cept routinely in his lectures in Ann Arbor [summer, 1958].” The earliest explicit def- inition of the spectral condition number I could find in writing was in Alston House- holder’s 1958 SIAM article [Hou58] (where he cites Turing) and then in Wilkinson’s book [Wil63], p.91).

By far, the most informative and researched history can be found in Joe Grcar’s 76 page article, “John von Neumann’s Analysis of Gaussian Elimination and the Origins of Modern Numerical Analysis” [Grc11]. Here he uncovers a letter from von Neumann to Goldstine (dated January 11, 1947) that explicitly names the ratio of the extreme singular values as ℓ. Why this was not included in their paper [vNG47] or made explicit