## Description

Math 753/853 HW2: Mathematics in finite precision

Before you begin!¶

First rename this file “math753-hw2-mylastname” using your last name in the indicated spot. With the file extension, the full filename should be “math753-hw2-mylastname.ipynb”. No capital letters, please.

To submit the homework email it to [email protected] with the subject “MATH 753 HW2” by midnight, Sunday Oct 1, 2017.

Writing explanations in your solutions. Many of the problems ask for explanations of answers and calculations. For these you should write one to several complete sentences that explain the reasoning behind your work in a manner that would be helpful to a fellow student.

Problem 0: example with solution

Use the Conditioning and Accuracy Theorem to estimate a value for the expected relative error |f~−f|/|f| for the floating-point calculation f~(x) of the mathematical problem f(x) in 64-bit arithmetic. Then devise a calculation in Julia that confirms your expectations. Explain each of your answers.

Subtraction, very different numbers. Estimate a value for the relative error for floating-point calculation of subtraction, f(x1,x2)=x2−x1 , using x1=17/13×1023 and x2=19/57 . Confirm your estimate by calculating the relative error in Julia. Explain each answer.

# The condition number of positive floating-point subtraction f(x₁, x₂) = x₂ – x₁ is

# κ = max(x₁, x₂)/|x₂ – x₁|

#

# for x₁ = 17/13 × 10²³ and x₂ = 19/57, this works out to

#

# κ(x) = 17/13 × 10²³ / |17/13 × 10²³ – 19/57| ≈ 1

#

# The conditioning and accuracy theorem then estimates the relative error

# of the 64-bit calculation to be

# |f̃ – f|/|f| = O(κ(x) ϵₘ) ≈ 2e-16

# Let’s verify that with a direct calculation in Julia. Use BigFloats to represent the

# “real” numbers in high precision and compute the “true” value f, and then calculate

# the floating-point value f̃ in Float64.

x₁ = BigFloat(17//13) * BigFloat(10)^23

x₂ = BigFloat(19//57)

f = x₂ – x₁

f̃ = Float64(x₂) – Float64(x₁)

abs(f̃ – f)/abs(f)

3.454132960784313725490204883084017685505574778953623547496062039450477003140082e-17

The relative error is about 3e-17, which slightly smaller than the estimate with machine precision 2e-16.

Problem 1

What is the next Float64 number after 6.0? Determine the precise value based on the structure of the 64-bit floating point number system as presented in lecture. Explain your reasoning. Then confirm your answer with a few calculations in Julia, and explain how the calculations confirm your expectations.

Problem 2

What range of integers can be represented exactly in the Float64 number system? I.e. what is the maximum value of N for which all integers between −N and N are represented exactly as Float64s? As before, determine the answer based on the structure of the 64-bit floating-point number system as presented in lecture, and explain your reasoning. Then confirm your answer with a few calculations in Julia, and explain how the calculations confirm your expectations.

Problem 3

There is a debate underway in the Julia development community whether the time library should represent time with floating-point numbers or with integers. (https://discourse.julialang.org/t/why-do-time-quantities-have-to-be-integers/5864/51) Suppose you want to measure time with millisecond accuracy near the present, but also dates as far back as 0 BC, using just a single number. What kind of number should you use, integer or floating-point? How many bits would you need? Explain your reasoning.

Problems 4-10

For each of problems 4 through 10, use the Conditioning and Accuracy Theorem to estimate a value for the expected relative error |f~−f|/|f| for the floating-point calculation f~(x) of the mathematical problem f(x) in 64-bit arithmetic. Then devise a calculation in Julia that confirms your expectations. Explain each of your answers.

Problem 4

Multiplication, order-1 numbers. Estimate a value for the relative error for floating-point calculation of multiplication f(x)=cx , with c=7/11 and x=19/3 (numbers both close to 1). Confirm your estimate by calculating the relative error in Julia. Explain each answer.

Problem 5

Multiplication, large numbers. Estimate a value for the relative error for floating-point calculation of multiplication, f(x)=cx , using c=2/3×1072 and x=5/7×10200 (two very large numbers). Confirm your estimate by calculating the relative error in Julia. Explain each answer.

Problem 6

Addition, order-1 numbers. Estimate a value for the relative error for floating-point calculation of addition, f(x1,x2)=x1+x2 , using x1=17/3 and x2=5/11 (two numbers near 1). Confirm your estimate by calculating the relative error in Julia. Explain each answer.

Problem 7

Subtraction, very close numbers. Estimate a value for the relative error for floating-point calculation of subtraction, f(x1,x2)=x2−x1 , using x1=129/11 and x2=x1+δ , where δ=10−13 . Confirm your estimate by calculating the relative error in Julia. Explain each answer.

Problem 8

Powers, big x , small n . Estimate a value for the relative error for floating-point calculation of the power function f(x)=xn for x=11/7×1013 and n=12 . Confirm your estimate by calculating the relative error in Julia. Explain each answer.

Problem 9

Powers, small n , big x . Estimate a value for the relative error for floating-point calculation of the power function f(x)=xn for x=12 and n=11/7×1013 . Compare your estimate to the relative error calculated in Julia. Explain each answer.

Problem 10

Solution of linear system. Estimate a value for the relative error for numerical solution of the linear system Ax=b for the given A and b . Compare your estimate to a direct calculation of the relative error in Julia. Explain.

Note that in this case the data is A,b , the solution is x , the mathematical function is is f(A,b)=A−1b=x , and the relative error is |x~−x∥/∥x∥ . The condition number of the Ax=b solve is κ(A)=∥A∥∥A−1∥ , which can be calculated in Julia with cond(A).

function randA(m, κ)

σ = logspace(0, -log10(κ), m)

Σ = diagm(σ)

U,tmp = qr(randn(m,m))

V,tmp = qr(randn(m,m))

A = U*Σ*V’

end

A = randA(5, 1e15); # generate a random matrix A with condition number 1e15

x = randn(5); # generate a random vector x, the true solution of Ax=b

b = A*x; # determine the right-hand-side vector b s.t. Ax=b

randA (generic function with 1 method)