CSE 310 — Data Structures and Algorithms

Project #3

In recent years there has been a great deal of interest in “small-world” graphs [5]. These types of graphs

are characterized by a high degree of local clustering and a small number of long-range connections, making

them very efficient at transferring information. This “small-world” architecture underlies the well-known

“six degrees of separation” phenomenon, in which it is believed that a connection can be found between any

two people on the planet requiring no more than six intermediate links.

In this project, we represent climatological data as a graph and compute some characteristics of it to

determine if it is a small-world graph.

Note: This project is to be completed individually. Your implementation must use C/C++ and ultimately

your code must run on the Linux machine general.asu.edu.

Any dynamic memory allocation must be done yourself, using either malloc() and free(), or new()

and delete(). You may not use any external libraries to implement any part of this project, aside from the

standard libraries for I/O, file I/O, and string functions. If you are in doubt about what you may use, ask!

You must use a version control system as you develop your solution to this project, e.g., GitHub or

similar. Your code repository must be private to prevent anyone from plagiarizing your work.

The rest of this project description is organized as follows. §1 describes the sea ice data set and how to

represent it as a graph. §2 describes the types of analyses to be performed on such graphs. §3 describes the

submission requirements for the milestone and the full project deadlines.

1 Arctic Sea Ice

Sea ice covers most of the Arctic Ocean and plays a significant role in the global water cycle and the global

energy balance. It is also considered to be a sensitive indicator of climate change. Thus, any changes in the

Earth’s climate are likely to first be seen in areas such as the high Arctic.

Since the 1970s, the areal extent of sea ice has been shrinking. In September of 2007, the mean sea

ice extent was 1.65 million square miles, which was the lowest ever recorded for the month of September,

shattering the previous record in 2005 by 23%. Current climate model projections indicate that the Arctic

could be seasonally ice-free by 2050–2100, which will significantly impact the global climate [2].

Because of its importance as a proxy indicator of climate change, a great deal of research is conducted

on Arctic sea ice. Data acquired by meteorological satellites provides one of the most effective ways to

study large-scale changes in sea ice conditions in the Arctic. The longest continuous satellite record of

sea ice comes from the Nimbus-7 Scanning Multi-channel Microwave Radiometer (SMMR) and Defense

Meteorological Satellite Program Special Sensor Microwave/Imager (DMSP SSM/I) series of meteorological

satellites. Data acquisition started in late 1978, with the first full year of data in 1979, and is ongoing. This

data is maintained for the Arctic and Antarctic by the National Snow & Ice Data Center (NSIDC).

1.1 Sea Ice Concentration Anomaly Data Set

The sea ice concentration (SIC) anomaly data set that we will use consists of 27 years (1979–2005) of weekly

SIC anomaly data derived from the SMMR-SSM/I passive microwave data set. An anomaly data set is

when the long-term average is subtracted from the data, to remove seasonal trends, making the data more

amenable to statistical analysis.

1

The data for each week is for a 304 ×448 floating point array representing the northern hemisphere. The

data value at each cell (x, y) in the array represents the percentage of deviation in ice concentration from the

27-year average for a given week. For example looking at the values of the array for week 30 of 1990, at cell

(100, 200), the value is -4.5. This means that at cell (100, 200) for week 30 of 1990, the sea ice concentration

was 4.5% lower than the 27-year average value for week 30 for that cell.

Since there are 52 weeks per year and 27 years of data, there are 52 × 27 = 1, 404 sea ice concentration

readings for each position (x, y) over the years. Therefore, for each of the 304 × 448 = 136, 192 positions

there is a time series [x, y, t], 1 ≤ t ≤ 1, 404, of SIC data with 1,404 values, starting at week 1 of 1979.

The data set is given as 1,404 files each containing a 304 × 448 32-bit floating point array (little-endian

byte order). The format of the filenames is: diffwNNyYYYY+landmask, where wNN denotes week NN and yYYYY

denotes the year. For example, diffw31y1983+landmask is the file for week 31 of 1983. The “+landmask”

part of the name indicates that a landmask was applied to the data.

Since we are dealing with sea ice, land masses can be ignored; these constitute approximately half of the

cells in each of the arrays. Land is denoted by the value 168. Missing data is denoted by the value of 157.

Figure 1(a) is a sample SSM/I sea ice concentration image, which has been pseudo-coloured to make it easier

to view. Each pixel corresponds to a nominal physical area of 25 square kilometers. There is a large circular

disk over the North Pole, an area of missing data due to the satellite’s orbit. The satellite orbits from pole

to pole (i.e., longitudinally), but at an incline, so there is a circular area that is not covered. Hence, the only

missing data is in the circular region over the North Pole.

(a) Full Arctic region (b) Beaufort Sea subregion

Figure 1: Sample SSM/I total sea ice concentration image.

Figure 1(b) shows a subregion near the Beaufort Sea (Ellesmere Island is on the lower right.) The

corresponding data set has only 63 × 63 = 3969 cells for each of the 16 years (1990–2005). This smaller data

set is otherwise identical to the full data set and should be used for code development and testing.

1.2 A Graph Representation for the Climatological Data Set

How is this type of climatological data represented as a graph? Tsonis et al. derived a correlation-based

graph G = (V, E) [1, 4]. The vertex set V corresponds to the cells, i.e., for the full SIC data set the graph has

304 × 448 vertices. To determine the edge set, the Pearson correlation coefficient (see §1.2.1) is calculated

between all pairs of cells (x, y) and (x

0

, y0

), 1 ≤ x, x0 ≤ 304, 1 ≤ y, y0 ≤ 448, (x, y) 6= (x

0

, y0

), of time series

vectors. That is, the correlation coefficient is computed between [x, y, t] and [x

0

, y0

, t], 1 ≤ t ≤ 1, 404, for

each pair of distinct cells.

Given that there are n = 136, 192 cells and there are n(n−1)/2 pairs of cells, the correlation coefficient is

calculated for 3,547,116 pairs of time series. If the correlation coefficient for a pair of cells (x, y) and (x

0

, y0

)

of time series, i.e., [x, y, t] and [x

0

, y0

, t], 1 ≤ t ≤ 1, 404, is greater than some threshold rthresh, then an edge

2

is inserted between cells (x, y) and (x

0

, y0

). The final result is a graph with edges between all cells having a

correlation greater than the threshold rthresh.

1.2.1 Pearson Correlation Coefficient

To get a measure of how strongly two vectors X and Y , of length n, are related, we use the correlation

coefficient. The Pearson correlation coefficient measures the strength and direction of a linear relationship

between X and Y . The formula for the sample correlation coefficient, denoted by r is:

r = p

Sxy

SxxSyy

where,

Sxx =

Xn

i=1

(Xi − X)

2

, Syy =

Xn

i=1

(Yi − Y )

2

, and Sxy =

Xn

i=1

(Xi − X)(Yi − Y ).

In our case the vector X corresponds to the time series of length 1,404 for cell (x, y), whereas the vector

Y corresponds to the time series of the same length for cell (x

0

, y0

). You will need to compute the sample

correlation coefficient between all pairs of distinct cells (x, y) and (x

0

, y0

). For a given pairs of cells (x, y)

and (x

0

, y0

), if |r| ≥ rthresh then you should insert an edge in the graph between the vertex corresponding to

(x, y) and the vertex corresponding to (x

0

, y0

). (The absolute value |r| of the correlation coefficient should

be used, since r can be positive or negative.)

2 Analyses of the Climatological Graph

We are interested in whether the graph derived from the SIC climatological data is a small-world graph.

First, construct a correlation-based graph Gr = (Vr, Er) for the sea ice anomaly dataset for each correlation

threshold rthresh ∈ {0.95, 0.925, 0.9}. Use an adjacency list of represent the graph. (Start with the largest

threshold as it will yield the sparsest graph.)

Now, for each correlation-based graph Gr:

1. Plot the histogram of the degree distribution of Gr. If |Vr| = n vertices, the degree of a vertex can

range from zero (the vertex is isolated) to n − 1 (the vertex has an edge to every other vertex in the

graph). A histogram of the degree distribution for Gr therefore plots a count of the number of vertices

of each degree d, 0 ≤ d ≤ n − 1.

Recall that a small-world graph is characterized by a high degree of local clustering and a small number

of long-range connections. As a result, we expect degree distribution of a small-world graph to have a

long-tailed distribution.

2. Compute the number of connected components in Gr and their size (i.e., number of vertices in each

component). Gr with n vertices may consist of from one to n connected components. A connected

component, C = (VC , EC ), is a subgraph of Gr such that VC ⊆ Vr and EC ⊆ Er. For any pair of

vertices u, v ∈ VC , there exists a path from u to v consisting of a finite number of edges in EC . Similarly,

every edge in EC has endpoints in the vertices of C. Therefore, every vertex within a component is

reachable from any other vertex in that component, but there are no paths between components. The

discovery of connected components is implemented in this project using a recursive depth-first traversal

of Gr.

For a small-world graph, what do you think the component structure should look like?

3. Another way to determine if the graph Gr is a small-world graph is to calculate the mean clustering

coefficient γ(G) and the characteristic path length (or diameter), L(G), of Gr and compare it to a

3

random graph Grandom of the same size (i.e., with the same number of vertices). (These characteristics

are defined in §2.1 and in §2.2, respectively.) For a small-world graph, γ(Gr) γ(Grandom) and

L(Gr) ≥ L(Grandom) [3, 5].

Compute the clustering coefficient, γ(Gr), and the characteristic path length, L(Gr), of the graph Gr.

4. Compare γ(Gr) and L(Gr) to γ(Grandom) and L(Grandom) for the random graph, Grandom, of comparable size. (See §2.3 for details.)

2.1 Clustering Coefficient

The neighbourhood N(v) of a vertex v consists of all the vertices adjacent to v. The graph generated by

N(v), hN(v)i has vertex set N(v) and its edges are all edges of the graph with both endpoints in N(v). Use

k(v) and e(v) to denote the numbers of vertices and edges in hN(v)i. The clustering coefficient γv of v is:

γv =

e(v)

k(v)

2

=

2e(v)

k(v)(k(v) − 1).

In words, γv for a vertex v is the proportion of edges between the vertices within its neighbourhood

divided by the number of edges that could possibly exist between them.

The clustering coefficient of a graph G is the mean of the clustering coefficient of all vertices of G and is

denoted γ(G).

2.2 Characteristic Path Length

Let di,j be the length of the shortest path between the vertices i and j. Then the characteristic path length

L(G) for the graph G = (V, E), is di,j averaged over all n

2

pairs of vertices, where n = |V |. Use the

Floyd-Warshall all-pairs shortest paths algorithm.

2.3 Metrics for Random Graphs

A random graph Grandom corresponding to Gr has the same number of vertices as Gr, namely n = |Vr|.

However, edges are inserted into Grandom at random such that the edge density of Grandom matches the

edge density of Gr, i.e., the probability p that edges are present in Grandom should match that of Gr. This

edge probability p is calculated from Gr having n vertices and mean vertex degree k as:

p =

nk

2

n

2

=

nk

2

n(n−1)

2

=

k

n − 1

.

This calculation is derived intuitively by reasoning that the fraction of actual edges nk

2

out of the total

number of possible edges n

2

should approximate the edge probability of the graph.

The clustering coefficient of a random graph Grandom is γ(Grandom) = k

n

. Similarly, the characteristic

path length of a random graph Grandom is L(Grandom) = log n

log k

. Here, n is the total number of vertices in

the correlation graph Gr, and k is the mean vertex degree of Gr.

Given the degree distribution of Gr it is straightforward to compute k.

3 Submission Instructions

Submissions are always due before 11:59pm on the deadline date.

1. For SLN 84794 (MW) the milestone is due on Wednesday, 11/20/2019. For SLN 83657 (TTh) the

milestone is due on Thursday, 11/21/2019. See §3.1 for requirements

2. For SLN 84794 (MW) the complete project is due on Wednesday, 12/04/2019. For SLN 83657 (TTh)

the complete project is due on Thursday, 12/05/2019. See §3.2 for requirements.

It is your responsibility to submit your project well before the time deadline!!! Late projects

are not accepted. Do not expect the clock on your machine to be synchronized with the one on Canvas!

An unlimited number of submissions are allowed. The last submission will be graded.

3.1 Requirements for Milestone Deadline

By the milestone deadline, your project must compute the degree distribution and number and size of connected components in a correlation based graph derived from the “small” Beaufort Sea SIC data set.

Using the submission link on Canvas for the Project #3 milestone, a zip1 file named yourFirstName-yourLastName.zip

containing the following:

Project State (5%): In a folder named State provide a brief report that addresses the following:

1. Explain all design decisions. Discuss your representation of the graph, and any optimizations you

made in computations. (Depending on the order in which you calculate the statistics, you can

likely save and make use of previous results to cut down on the computation time.) Can you

compute a worst-case bound on the time and/or space of your algorithms?

2. Describe any problems encountered in your implementation for this project milestone.

3. Describe any known bugs in your project milestone.

4. While this project is to be completed individually, describe any significant interactions with anyone

(peers or otherwise) that may have occurred.

5. Cite any external code bases, books, and/or websites used or referenced.

Implementation (50%): In a folder named Code provide:

1. In one or more files, your well documented C/C++ source code implementing this project milestone.

2. A makefile that compiles your program to an executable named seaice on the Linux machine

general.asu.edu. Our TA will write a script to compile and run all student submissions on

general.asu.edu; therefore executing the command make seaice in the Code directory must

produce the executable seaice also located in the Code directory.

Report (20%): In a folder named Report provide a report containing the following:

1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold

rthresh ∈ {0.95, 0.925, 0.9}, plot the degree distribution.

(a) Do you think the degree distribution is consistent with that of a small-world graph? Why or

why not?

(b) Identify any supernodes, i.e., vertices with significantly higher vertex degree than the average,

and where they occur. Describe your determination of supernode.

2. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold

rthresh ∈ {0.95, 0.925, 0.9}, compute the number of connected components in Gr and their size

(i.e., number of vertices).

(a) For a small-world graph, how do you think the component structure should look?

(b) Do your results support your hypothesis?

1Do not use any other archiving program except zip.

5

Correctness (25%): The correctness of your program will be determined by running your program on the

“small” data set for a given threshold.

The milestone is worth 30% of the total project grade.

3.2 Requirements for Complete Project Deadline

For the full project deadline, your project must perform all the analyses described in §2 on correlation based

graphs derived from the “small” Beaufort Sea SIC data set. If you can run the analyses on the full data set,

a bonus of up to 10% is available.

Using the submission link on Canvas for the complete Project #3, a zip2 file named yourFirstName-yourLastName.zip

containing the following:

Project State (5%): Follow the same instructions for Project State as in §3.1.

Implementation (40%): Follow the same instructions for Implementation as in §3.1.

Report (25%): In addition to the analyses run for the Report for the milestone deadline (§3.1), you must

also run the following analyses:

1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold

rthresh ∈ {0.95, 0.925, 0.9}, compute the clustering coefficient, γ(Gr), and the characteristic path

length, L(Gr), of the graph Gr.

2. Compute the clustering coefficient, γ(Grandom), and the characteristic path length, L(Grandom),

of a random graph Grandom corresponding to each Gr.

(a) Compare γ(Gr) and L(Gr) to γ(Grandom) and L(Grandom) for the random graph, Grandom,

of comparable size.

(b) What conclusion(s) can you draw from your results?

Correctness (30%): The same instructions for Correctness as in §3.1 apply.

Bonus (10%): Repeat the analyses on the full data set. (I know that this is a lot of extra work for a bonus;

it is a good indicator of the scalability of your code.)

Acknowledgements

Thanks to my former student Wayne S. Chan who motivated this project.

References

[1] J. P. Onnela, K. Kaski, and J. Kert´esz. Clustering and information in correlation based financial networks.

European Physical Journal B, 38(2):353–362, March 2004.

[2] J. T. Overpeck, M. Sturm, J. A. Francis, D. K. Perovich, M. C. Serreze, R. Benner, E. C. Carmack, F. S. Chapin

III, S. C. Gerlach, L. C. Hamilton, L. D. Hinzman, M. Holland, H. P. Huntington, J. R. Key, A. H. Lloyd, G. M.

MacDonald, J. McFadden, D. Noone, T. D. Prowse, P. Schlosser, and C. Vorosmarty. Arctic system on trajectory

to new, seasonally ice-free state. Earth Observation Science, 86(34):309–316, August 2005.

[3] A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What do networks have to do with climate? Bulletin of the

American Meteorological Society, pages 585–595, May 2006.

[4] M. Tumminello, D. Matteo, T. Aste, and R. N. Mantegna. Correlation based networks of equity returns sampled

at different time horizons. European Physical Journal B, 55:209–217, 2007.

[5] D. J. Watts and S. H. Strogatz. Collective dynamics of “small-world” networks. Nature, 393:440–442, 1998.

2Do not use any other archiving program except zip.

6