CSE 310 — Data Structures and Algorithms Project #3 solved

$28.00

Category:

Description

In recent years there has been a great deal of interest in “small-world” graphs [5]. These types of graphs
are characterized by a high degree of local clustering and a small number of long-range connections, making
them very efficient at transferring information. This “small-world” architecture underlies the well-known
“six degrees of separation” phenomenon, in which it is believed that a connection can be found between any
two people on the planet requiring no more than six intermediate links.
In this project, we represent climatological data as a graph and compute some characteristics of it to
determine if it is a small-world graph.
Note: This project is to be completed individually. Your implementation must use C/C++ and ultimately
your code must run on the Linux machine general.asu.edu.
Any dynamic memory allocation must be done yourself, using either malloc() and free(), or new()
and delete(). You may not use any external libraries to implement any part of this project, aside from the
standard libraries for I/O, file I/O, and string functions. If you are in doubt about what you may use, ask!
You must use a version control system as you develop your solution to this project, e.g., GitHub or
similar. Your code repository must be private to prevent anyone from plagiarizing your work.
The rest of this project description is organized as follows. §1 describes the sea ice data set and how to
represent it as a graph. §2 describes the types of analyses to be performed on such graphs. §3 describes the
submission requirements for the milestone and the full project deadlines.
1 Arctic Sea Ice
Sea ice covers most of the Arctic Ocean and plays a significant role in the global water cycle and the global
energy balance. It is also considered to be a sensitive indicator of climate change. Thus, any changes in the
Earth’s climate are likely to first be seen in areas such as the high Arctic.
Since the 1970s, the areal extent of sea ice has been shrinking. In September of 2007, the mean sea
ice extent was 1.65 million square miles, which was the lowest ever recorded for the month of September,
shattering the previous record in 2005 by 23%. Current climate model projections indicate that the Arctic
could be seasonally ice-free by 2050–2100, which will significantly impact the global climate [2].
Because of its importance as a proxy indicator of climate change, a great deal of research is conducted
on Arctic sea ice. Data acquired by meteorological satellites provides one of the most effective ways to
study large-scale changes in sea ice conditions in the Arctic. The longest continuous satellite record of
sea ice comes from the Nimbus-7 Scanning Multi-channel Microwave Radiometer (SMMR) and Defense
Meteorological Satellite Program Special Sensor Microwave/Imager (DMSP SSM/I) series of meteorological
satellites. Data acquisition started in late 1978, with the first full year of data in 1979, and is ongoing. This
data is maintained for the Arctic and Antarctic by the National Snow & Ice Data Center (NSIDC).
1.1 Sea Ice Concentration Anomaly Data Set
The sea ice concentration (SIC) anomaly data set that we will use consists of 27 years (1979–2005) of weekly
SIC anomaly data derived from the SMMR-SSM/I passive microwave data set. An anomaly data set is
when the long-term average is subtracted from the data, to remove seasonal trends, making the data more
amenable to statistical analysis.
1
The data for each week is for a 304 ×448 floating point array representing the northern hemisphere. The
data value at each cell (x, y) in the array represents the percentage of deviation in ice concentration from the
27-year average for a given week. For example looking at the values of the array for week 30 of 1990, at cell
(100, 200), the value is -4.5. This means that at cell (100, 200) for week 30 of 1990, the sea ice concentration
was 4.5% lower than the 27-year average value for week 30 for that cell.
Since there are 52 weeks per year and 27 years of data, there are 52 × 27 = 1, 404 sea ice concentration
readings for each position (x, y) over the years. Therefore, for each of the 304 × 448 = 136, 192 positions
there is a time series [x, y, t], 1 ≤ t ≤ 1, 404, of SIC data with 1,404 values, starting at week 1 of 1979.
The data set is given as 1,404 files each containing a 304 × 448 32-bit floating point array (little-endian
byte order). The format of the filenames is: diffwNNyYYYY+landmask, where wNN denotes week NN and yYYYY
denotes the year. For example, diffw31y1983+landmask is the file for week 31 of 1983. The “+landmask”
part of the name indicates that a landmask was applied to the data.
Since we are dealing with sea ice, land masses can be ignored; these constitute approximately half of the
cells in each of the arrays. Land is denoted by the value 168. Missing data is denoted by the value of 157.
Figure 1(a) is a sample SSM/I sea ice concentration image, which has been pseudo-coloured to make it easier
to view. Each pixel corresponds to a nominal physical area of 25 square kilometers. There is a large circular
disk over the North Pole, an area of missing data due to the satellite’s orbit. The satellite orbits from pole
to pole (i.e., longitudinally), but at an incline, so there is a circular area that is not covered. Hence, the only
missing data is in the circular region over the North Pole.
(a) Full Arctic region (b) Beaufort Sea subregion
Figure 1: Sample SSM/I total sea ice concentration image.
Figure 1(b) shows a subregion near the Beaufort Sea (Ellesmere Island is on the lower right.) The
corresponding data set has only 63 × 63 = 3969 cells for each of the 16 years (1990–2005). This smaller data
set is otherwise identical to the full data set and should be used for code development and testing.
1.2 A Graph Representation for the Climatological Data Set
How is this type of climatological data represented as a graph? Tsonis et al. derived a correlation-based
graph G = (V, E) [1, 4]. The vertex set V corresponds to the cells, i.e., for the full SIC data set the graph has
304 × 448 vertices. To determine the edge set, the Pearson correlation coefficient (see §1.2.1) is calculated
between all pairs of cells (x, y) and (x
0
, y0
), 1 ≤ x, x0 ≤ 304, 1 ≤ y, y0 ≤ 448, (x, y) 6= (x
0
, y0
), of time series
vectors. That is, the correlation coefficient is computed between [x, y, t] and [x
0
, y0
, t], 1 ≤ t ≤ 1, 404, for
each pair of distinct cells.
Given that there are n = 136, 192 cells and there are n(n−1)/2 pairs of cells, the correlation coefficient is
calculated for 3,547,116 pairs of time series. If the correlation coefficient for a pair of cells (x, y) and (x
0
, y0
)
of time series, i.e., [x, y, t] and [x
0
, y0
, t], 1 ≤ t ≤ 1, 404, is greater than some threshold rthresh, then an edge
2
is inserted between cells (x, y) and (x
0
, y0
). The final result is a graph with edges between all cells having a
correlation greater than the threshold rthresh.
1.2.1 Pearson Correlation Coefficient
To get a measure of how strongly two vectors X and Y , of length n, are related, we use the correlation
coefficient. The Pearson correlation coefficient measures the strength and direction of a linear relationship
between X and Y . The formula for the sample correlation coefficient, denoted by r is:
r = p
Sxy
SxxSyy
where,
Sxx =
Xn
i=1
(Xi − X)
2
, Syy =
Xn
i=1
(Yi − Y )
2
, and Sxy =
Xn
i=1
(Xi − X)(Yi − Y ).
In our case the vector X corresponds to the time series of length 1,404 for cell (x, y), whereas the vector
Y corresponds to the time series of the same length for cell (x
0
, y0
). You will need to compute the sample
correlation coefficient between all pairs of distinct cells (x, y) and (x
0
, y0
). For a given pairs of cells (x, y)
and (x
0
, y0
), if |r| ≥ rthresh then you should insert an edge in the graph between the vertex corresponding to
(x, y) and the vertex corresponding to (x
0
, y0
). (The absolute value |r| of the correlation coefficient should
be used, since r can be positive or negative.)
2 Analyses of the Climatological Graph
We are interested in whether the graph derived from the SIC climatological data is a small-world graph.
First, construct a correlation-based graph Gr = (Vr, Er) for the sea ice anomaly dataset for each correlation
threshold rthresh ∈ {0.95, 0.925, 0.9}. Use an adjacency list of represent the graph. (Start with the largest
threshold as it will yield the sparsest graph.)
Now, for each correlation-based graph Gr:
1. Plot the histogram of the degree distribution of Gr. If |Vr| = n vertices, the degree of a vertex can
range from zero (the vertex is isolated) to n − 1 (the vertex has an edge to every other vertex in the
graph). A histogram of the degree distribution for Gr therefore plots a count of the number of vertices
of each degree d, 0 ≤ d ≤ n − 1.
Recall that a small-world graph is characterized by a high degree of local clustering and a small number
of long-range connections. As a result, we expect degree distribution of a small-world graph to have a
long-tailed distribution.
2. Compute the number of connected components in Gr and their size (i.e., number of vertices in each
component). Gr with n vertices may consist of from one to n connected components. A connected
component, C = (VC , EC ), is a subgraph of Gr such that VC ⊆ Vr and EC ⊆ Er. For any pair of
vertices u, v ∈ VC , there exists a path from u to v consisting of a finite number of edges in EC . Similarly,
every edge in EC has endpoints in the vertices of C. Therefore, every vertex within a component is
reachable from any other vertex in that component, but there are no paths between components. The
discovery of connected components is implemented in this project using a recursive depth-first traversal
of Gr.
For a small-world graph, what do you think the component structure should look like?
3. Another way to determine if the graph Gr is a small-world graph is to calculate the mean clustering
coefficient γ(G) and the characteristic path length (or diameter), L(G), of Gr and compare it to a
3
random graph Grandom of the same size (i.e., with the same number of vertices). (These characteristics
are defined in §2.1 and in §2.2, respectively.) For a small-world graph, γ(Gr)  γ(Grandom) and
L(Gr) ≥ L(Grandom) [3, 5].
Compute the clustering coefficient, γ(Gr), and the characteristic path length, L(Gr), of the graph Gr.
4. Compare γ(Gr) and L(Gr) to γ(Grandom) and L(Grandom) for the random graph, Grandom, of comparable size. (See §2.3 for details.)
2.1 Clustering Coefficient
The neighbourhood N(v) of a vertex v consists of all the vertices adjacent to v. The graph generated by
N(v), hN(v)i has vertex set N(v) and its edges are all edges of the graph with both endpoints in N(v). Use
k(v) and e(v) to denote the numbers of vertices and edges in hN(v)i. The clustering coefficient γv of v is:
γv =
e(v)

k(v)
2
 =
2e(v)
k(v)(k(v) − 1).
In words, γv for a vertex v is the proportion of edges between the vertices within its neighbourhood
divided by the number of edges that could possibly exist between them.
The clustering coefficient of a graph G is the mean of the clustering coefficient of all vertices of G and is
denoted γ(G).
2.2 Characteristic Path Length
Let di,j be the length of the shortest path between the vertices i and j. Then the characteristic path length
L(G) for the graph G = (V, E), is di,j averaged over all n
2

pairs of vertices, where n = |V |. Use the
Floyd-Warshall all-pairs shortest paths algorithm.
2.3 Metrics for Random Graphs
A random graph Grandom corresponding to Gr has the same number of vertices as Gr, namely n = |Vr|.
However, edges are inserted into Grandom at random such that the edge density of Grandom matches the
edge density of Gr, i.e., the probability p that edges are present in Grandom should match that of Gr. This
edge probability p is calculated from Gr having n vertices and mean vertex degree k as:
p =
nk
2
n
2
 =
nk
2
n(n−1)
2
=
k
n − 1
.
This calculation is derived intuitively by reasoning that the fraction of actual edges nk
2
out of the total
number of possible edges n
2

should approximate the edge probability of the graph.
The clustering coefficient of a random graph Grandom is γ(Grandom) = k
n
. Similarly, the characteristic
path length of a random graph Grandom is L(Grandom) = log n
log k
. Here, n is the total number of vertices in
the correlation graph Gr, and k is the mean vertex degree of Gr.
Given the degree distribution of Gr it is straightforward to compute k.
3 Submission Instructions
Submissions are always due before 11:59pm on the deadline date.
1. For SLN 84794 (MW) the milestone is due on Wednesday, 11/20/2019. For SLN 83657 (TTh) the
milestone is due on Thursday, 11/21/2019. See §3.1 for requirements
2. For SLN 84794 (MW) the complete project is due on Wednesday, 12/04/2019. For SLN 83657 (TTh)
the complete project is due on Thursday, 12/05/2019. See §3.2 for requirements.
It is your responsibility to submit your project well before the time deadline!!! Late projects
are not accepted. Do not expect the clock on your machine to be synchronized with the one on Canvas!
An unlimited number of submissions are allowed. The last submission will be graded.
3.1 Requirements for Milestone Deadline
By the milestone deadline, your project must compute the degree distribution and number and size of connected components in a correlation based graph derived from the “small” Beaufort Sea SIC data set.
Using the submission link on Canvas for the Project #3 milestone, a zip1 file named yourFirstName-yourLastName.zip
containing the following:
Project State (5%): In a folder named State provide a brief report that addresses the following:
1. Explain all design decisions. Discuss your representation of the graph, and any optimizations you
made in computations. (Depending on the order in which you calculate the statistics, you can
likely save and make use of previous results to cut down on the computation time.) Can you
compute a worst-case bound on the time and/or space of your algorithms?
2. Describe any problems encountered in your implementation for this project milestone.
3. Describe any known bugs in your project milestone.
4. While this project is to be completed individually, describe any significant interactions with anyone
(peers or otherwise) that may have occurred.
5. Cite any external code bases, books, and/or websites used or referenced.
Implementation (50%): In a folder named Code provide:
1. In one or more files, your well documented C/C++ source code implementing this project milestone.
2. A makefile that compiles your program to an executable named seaice on the Linux machine
general.asu.edu. Our TA will write a script to compile and run all student submissions on
general.asu.edu; therefore executing the command make seaice in the Code directory must
produce the executable seaice also located in the Code directory.
Report (20%): In a folder named Report provide a report containing the following:
1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold
rthresh ∈ {0.95, 0.925, 0.9}, plot the degree distribution.
(a) Do you think the degree distribution is consistent with that of a small-world graph? Why or
why not?
(b) Identify any supernodes, i.e., vertices with significantly higher vertex degree than the average,
and where they occur. Describe your determination of supernode.
2. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold
rthresh ∈ {0.95, 0.925, 0.9}, compute the number of connected components in Gr and their size
(i.e., number of vertices).
(a) For a small-world graph, how do you think the component structure should look?
(b) Do your results support your hypothesis?
1Do not use any other archiving program except zip.
5
Correctness (25%): The correctness of your program will be determined by running your program on the
“small” data set for a given threshold.
The milestone is worth 30% of the total project grade.
3.2 Requirements for Complete Project Deadline
For the full project deadline, your project must perform all the analyses described in §2 on correlation based
graphs derived from the “small” Beaufort Sea SIC data set. If you can run the analyses on the full data set,
a bonus of up to 10% is available.
Using the submission link on Canvas for the complete Project #3, a zip2 file named yourFirstName-yourLastName.zip
containing the following:
Project State (5%): Follow the same instructions for Project State as in §3.1.
Implementation (40%): Follow the same instructions for Implementation as in §3.1.
Report (25%): In addition to the analyses run for the Report for the milestone deadline (§3.1), you must
also run the following analyses:
1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold
rthresh ∈ {0.95, 0.925, 0.9}, compute the clustering coefficient, γ(Gr), and the characteristic path
length, L(Gr), of the graph Gr.
2. Compute the clustering coefficient, γ(Grandom), and the characteristic path length, L(Grandom),
of a random graph Grandom corresponding to each Gr.
(a) Compare γ(Gr) and L(Gr) to γ(Grandom) and L(Grandom) for the random graph, Grandom,
of comparable size.
(b) What conclusion(s) can you draw from your results?
Correctness (30%): The same instructions for Correctness as in §3.1 apply.
Bonus (10%): Repeat the analyses on the full data set. (I know that this is a lot of extra work for a bonus;
it is a good indicator of the scalability of your code.)
Acknowledgements
Thanks to my former student Wayne S. Chan who motivated this project.
References
[1] J. P. Onnela, K. Kaski, and J. Kert´esz. Clustering and information in correlation based financial networks.
European Physical Journal B, 38(2):353–362, March 2004.
[2] J. T. Overpeck, M. Sturm, J. A. Francis, D. K. Perovich, M. C. Serreze, R. Benner, E. C. Carmack, F. S. Chapin
III, S. C. Gerlach, L. C. Hamilton, L. D. Hinzman, M. Holland, H. P. Huntington, J. R. Key, A. H. Lloyd, G. M.
MacDonald, J. McFadden, D. Noone, T. D. Prowse, P. Schlosser, and C. Vorosmarty. Arctic system on trajectory
to new, seasonally ice-free state. Earth Observation Science, 86(34):309–316, August 2005.
[3] A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What do networks have to do with climate? Bulletin of the
American Meteorological Society, pages 585–595, May 2006.
[4] M. Tumminello, D. Matteo, T. Aste, and R. N. Mantegna. Correlation based networks of equity returns sampled
at different time horizons. European Physical Journal B, 55:209–217, 2007.
[5] D. J. Watts and S. H. Strogatz. Collective dynamics of “small-world” networks. Nature, 393:440–442, 1998.
2Do not use any other archiving program except zip.
6