COMP182 Homework 4: Graphs and Their Exploration
Student Responses
---
Problem 1: Growth of Functions, Algorithms, and Summations [45 pts]
Growth of Functions [25 pts]
Part A
Claim: 1^k + 2^k + ... + n^k = O(n^(k+1)).
Proof: For each term i^k in the sum where 1 <= i <= n, we have i <= n, so i^k <= n^k. Therefore:
sum_{i=1}^{n} i^k <= sum_{i=1}^{n} n^k = n * n^k = n^(k+1)
Thus, for C = 1 and n_0 = 1, we have:
sum_{i=1}^{n} i^k <= C * n^(k+1) for all n >= n_0
By definition, this means 1^k + 2^k + ... + n^k = O(n^(k+1)). QED
Part B
Claim: For all real numbers a > 1 and b > 1, if f(x) = O(log_b x), then f(x) = O(log_a x).
Proof: Since f(x) = O(log_b x), there exist constants C > 0 and x_0 > 0 such that:
By the change-of-base formula for logarithms:
log_b(x) = log_a(x) / log_a(b)
Substituting:
Since a > 1 and b > 1, we have log_a(b) > 0, so C' = C / log_a(b) is a positive constant.
Therefore |f(x)| <= C' * log_a(x) for all x >= x_0, which means f(x) = O(log_a x). QED
Part C
Claim: log(x^2 + 1) = Theta(log(x)).
We need to show both log(x^2 + 1) = O(log(x)) and log(x) = O(log(x^2 + 1)).
Part 1: log(x^2 + 1) = O(log(x))
For x >= 1, we have x^2 + 1 <= x^2 + x^2 = 2x^2. Therefore:
log(x^2 + 1) <= log(2x^2) = log(2) + 2 log(x)
For x >= 2, log(x) >= log(2), so log(2) <= log(x). Thus:
log(x^2 + 1) <= log(x) + 2 log(x) = 3 log(x)
So with C = 3 and x_0 = 2, we have log(x^2 + 1) <= 3 * log(x) for all x >= 2, giving log(x^2 + 1) = O(log(x)).
Part 2: log(x) = O(log(x^2 + 1))
For x >= 1, we have x^2 + 1 >= x^2 >= x (since x >= 1 implies x^2 >= x). Therefore:
log(x^2 + 1) >= log(x)
So log(x) <= 1 * log(x^2 + 1) for all x >= 1, giving log(x) = O(log(x^2 + 1)) with C = 1 and x_0 = 1.
Conclusion: Since log(x^2 + 1) = O(log(x)) and log(x) = O(log(x^2 + 1)), we have log(x^2 + 1) = Theta(log(x)). QED
Part D
Claim: If f(x) = Theta(g(x)) and g(x) = Theta(h(x)), then f(x) = Theta(h(x)).
Proof: Since f(x) = Theta(g(x)), there exist constants C_1, C_2 > 0 and x_1 such that:
C_1 * g(x) <= f(x) <= C_2 * g(x) for all x >= x_1
Since g(x) = Theta(h(x)), there exist constants C_3, C_4 > 0 and x_2 such that:
C_3 * h(x) <= g(x) <= C_4 * h(x) for all x >= x_2
Let x_0 = max(x_1, x_2). For all x >= x_0, both inequalities hold simultaneously.
Upper bound: f(x) <= C_2 * g(x) <= C_2 * C_4 * h(x)
Lower bound: f(x) >= C_1 * g(x) >= C_1 * C_3 * h(x)
Therefore:
C_1 * C_3 * h(x) <= f(x) <= C_2 * C_4 * h(x) for all x >= x_0
Since C_1 * C_3 > 0 and C_2 * C_4 > 0, this means f(x) = Theta(h(x)) by definition. QED
Part E
Ordering (from slowest to fastest growth):
1. n log(n) log(log(n))
2. n (log(n))^(3/2)
3. n^(4/3) (log(n))^2
4. n^(3/2)
5. n^(log(n))
6. 2^(100n)
7. 2^(n^2)
8. 2^(2^n)
9. 2^(n!)
Justification:
1 = O(2): Compare n log(n) log(log(n)) vs. n (log(n))^(3/2). The ratio is log(log(n)) / (log(n))^(1/2), which tends to 0 as n -> infinity since log(log(n)) grows much slower than any positive power of log(n).
2 = O(3): Compare n (log(n))^(3/2) vs. n^(4/3) (log(n))^2. The ratio is 1 / (n^(1/3) (log(n))^(1/2)), which tends to 0 as n -> infinity since the polynomial factor n^(1/3) dominates any logarithmic factor.
3 = O(4): Compare n^(4/3) (log(n))^2 vs. n^(3/2). The ratio is (log(n))^2 / n^(1/6), which tends to 0 as n -> infinity since any polynomial power of n dominates any power of log(n).
4 = O(5): Compare n^(3/2) vs. n^(log(n)). For sufficiently large n, log(n) > 3/2, so n^(log(n)) grows faster. More precisely, taking logarithms: (3/2) log(n) vs. (log(n))^2, and (log(n))^2 dominates (3/2) log(n) for large n.
5 = O(6): Compare n^(log(n)) vs. 2^(100n). Taking log base 2: n^(log(n)) = 2^(log(n) * log(n)) = 2^((log(n))^2). Since 100n grows much faster than (log(n))^2, we have 2^((log(n))^2) = O(2^(100n)).
6 = O(7): Compare 2^(100n) vs. 2^(n^2). Since n^2 > 100n for n > 100, the exponent n^2 dominates, so 2^(100n) = O(2^(n^2)).
7 = O(8): Compare 2^(n^2) vs. 2^(2^n). Since 2^n > n^2 for n >= 5 (as exponentials grow faster than polynomials), the exponent 2^n dominates n^2, so 2^(n^2) = O(2^(2^n)).
8 = O(9): Compare 2^(2^n) vs. 2^(n!). By Stirling's approximation, n! ~ (n/e)^n * sqrt(2*pi*n). Since (n/e)^n grows much faster than 2^n (taking logs: n ln(n/e) vs. n ln(2), and ln(n/e) >> ln(2) for large n), we have n! >> 2^n, so 2^(2^n) = O(2^(n!)).
---
Algorithms and Their Complexity [10 pts]
Part A: Algorithm ComputeNumberOfShortestPaths
By the theorem in Section 10.4, the entry (A^k)[i][j] of the k-th power of the adjacency matrix A gives the number of walks of length exactly k from node i to node j. When k equals the shortest distance d(i,j), every walk of length k must be a path (since any cycle would allow a shorter walk, contradicting the minimality of d(i,j)). Therefore A^d(i,j)[i][j] gives the number of shortest paths from i to j.
A shortest path from i to j that passes through edge e = (u, v) either traverses the edge as u -> v or as v -> u. In the first case, the path decomposes as: a shortest path from i to u of length d(i,u), followed by edge (u,v), followed by a shortest path from v to j of length d(v,j), with total length d(i,u) + 1 + d(v,j) = d(i,j). The number of such paths is A^d(i,u)[i][u] * A^d(v,j)[v][j]. Similarly for the v -> u direction.
Algorithm ComputeNumberOfShortestPaths(A, i, j, e = (u, v)):
Input: Adjacency matrix A (n x n) of graph g, nodes i and j, edge e = (u, v)
Output: Number of shortest paths from i to j that go through edge e
1 n <- number of rows of A
2 P[0] <- I_n // n x n identity matrix
3 for ell = 1 to n - 1 do
4 P[ell] <- MatrixMultiplication(P[ell - 1], A)
5
6 // Find shortest distances using matrix powers
7 // dist(a, b) = 0 if a = b, else smallest ell such that P[ell][a][b] > 0, else infinity
8 d_ij <- FindDist(P, i, j, n)
9 if d_ij = infinity then return 0 // j is not reachable from i
10
11 d_iu <- FindDist(P, i, u, n)
12 d_vj <- FindDist(P, v, j, n)
13 d_iv <- FindDist(P, i, v, n)
14 d_uj <- FindDist(P, u, j, n)
15
16 count <- 0
17 // Case 1: path goes i -> ... -> u -> v -> ... -> j
18 if d_iu + 1 + d_vj = d_ij then
19 count <- count + P[d_iu][i][u] * P[d_vj][v][j]
20 // Case 2: path goes i -> ... -> v -> u -> ... -> j
21 if d_iv + 1 + d_uj = d_ij then
22 count <- count + P[d_iv][i][v] * P[d_uj][u][j]
23 return count
Subroutine FindDist(P, a, b, n):
if a = b then return 0
for ell = 1 to n - 1 do
if P[ell][a][b] > 0 then return ell
return infinity
Part B: Worst-Case Running Time
The dominant cost is computing the matrix powers P[1], P[2], ..., P[n-1] in lines 3-4. Each call to MatrixMultiplication on n x n matrices takes O(n^3) operations. We compute at most n - 1 such products, giving O(n * n^3) = O(n^4) total for this step.
The FindDist subroutine scans through the precomputed powers and takes O(n) time per call. We call it 5 times, contributing O(n). All other operations are O(1).
The worst case corresponds to when the diameter of the graph is n - 1 (e.g., a path graph on n nodes), requiring us to compute all n - 1 matrix powers. In this case, the distance between pairs of nodes can be as large as n - 1.
Total worst-case running time: O(n^4).
---
Sequences and Summations [10 pts]
Part A: Deriving sum_{k=1}^{n} k^2 using the Telescoping Sum
Let a_k = k^3. The telescoping sum gives:
sum_{j=1}^{n} (a_j - a_{j-1}) = a_n - a_0 = n^3 - 0 = n^3
Now we expand a_j - a_{j-1}:
a_j - a_{j-1} = j^3 - (j-1)^3
Expanding (j-1)^3 = j^3 - 3j^2 + 3j - 1:
j^3 - (j^3 - 3j^2 + 3j - 1) = 3j^2 - 3j + 1
So the telescoping sum becomes:
sum_{j=1}^{n} (3j^2 - 3j + 1) = n^3
Splitting the sum:
3 * sum_{j=1}^{n} j^2 - 3 * sum_{j=1}^{n} j + sum_{j=1}^{n} 1 = n^3
Using the known formula sum_{j=1}^{n} j = n(n+1)/2 and sum_{j=1}^{n} 1 = n:
3 * sum_{j=1}^{n} j^2 - 3 * n(n+1)/2 + n = n^3
Solving for the sum of squares:
3 * sum_{j=1}^{n} j^2 = n^3 + 3n(n+1)/2 - n
3 * sum_{j=1}^{n} j^2 = n^3 + (3n^2 + 3n)/2 - n
Combining over a common denominator:
3 * sum_{j=1}^{n} j^2 = (2n^3 + 3n^2 + 3n - 2n) / 2 = (2n^3 + 3n^2 + n) / 2
Therefore:
sum_{j=1}^{n} j^2 = (2n^3 + 3n^2 + n) / 6 = n(2n^2 + 3n + 1) / 6 = n(n+1)(2n+1) / 6
(where we factored 2n^2 + 3n + 1 = (n+1)(2n+1)). QED
Part B: Showing sum_{k=1}^{n} k^2 = O(n^3)
Using the closed formula derived in Part A:
sum_{k=1}^{n} k^2 = n(n+1)(2n+1) / 6
For n >= 1:
Therefore:
n(n+1)(2n+1) / 6 <= n * 2n * 3n / 6 = 6n^3 / 6 = n^3
So with C = 1 and n_0 = 1 (using k = n_0 in the problem's notation), we have:
sum_{k=1}^{n} k^2 <= 1 * n^3 for all n >= 1
This proves that sum_{k=1}^{n} k^2 = O(n^3). QED
---
Problem 2: Breadth-First Search and Its Applications [30 pts]
Part A [10 pts]: Algorithm ComputeDistances
The key idea is to replace the boolean visited flag v_j with a distance value d_j. Initially all distances are infinity (unvisited). The source gets distance 0. When we discover a new neighbor, its distance is one more than its parent's distance.
Algorithm ComputeDistances(g, i):
Input: Graph g = (V, E); source node i in V.
Output: d_j for all j in V, where d_j is the distance from i to j.
1 foreach j in V do
2 d_j <- infinity // Distance to j is unknown
3 d_i <- 0 // Distance to source is 0
4 Initialize Q to an empty queue
5 enqueue(Q, i)
6 while Q is not empty do
7 j <- dequeue(Q)
8 foreach neighbor h of j do
9 if d_h = infinity then
10 d_h <- d_j + 1
11 enqueue(Q, h)
12 return d
Running time analysis: O(m + n)
- Initialization (lines 1-2): The foreach loop iterates over all n nodes, taking O(n) time.
- Main BFS loop (lines 6-11): Each node is enqueued and dequeued at most once, because once d_h is set to a finite value (line 10), the condition d_h = infinity (line 9) will never be true again for node h, so it is never enqueued again. Therefore, the while loop executes at most n iterations. For each dequeued node j, we examine all of its neighbors (line 8). Across all iterations, the total number of neighbor examinations is the sum of degrees of all nodes, which equals 2m. So the total work in the while loop is O(n + m).
- Total: O(n) + O(n + m) = O(m + n).
Correctness: By the properties of BFS, the first time we reach a node h, we reach it via a shortest path from i. Since d_h = d_j + 1 where j is the node from which h was discovered, and j was itself discovered at its correct shortest distance, by induction all distances are computed correctly. Nodes unreachable from i retain d_h = infinity.
Part B [10 pts]: Algorithm IsBipartite
The idea is to modify BFS to 2-color the graph. We assign the source color 0, and alternate colors as we traverse edges. If we ever find two adjacent nodes with the same color, the graph is not bipartite.
Algorithm IsBipartite(g):
Input: Graph g = (V, E), assumed connected.
Output: True if g is bipartite, False otherwise.
1 Pick any node i in V
2 foreach j in V do
3 color_j <- -1 // Node j has not been colored
4 color_i <- 0 // Color source with color 0
5 Initialize Q to an empty queue
6 enqueue(Q, i)
7 while Q is not empty do
8 j <- dequeue(Q)
9 foreach neighbor h of j do
10 if color_h = -1 then
11 color_h <- 1 - color_j // Assign opposite color
12 enqueue(Q, h)
13 else if color_h = color_j then
14 return False // Adjacent nodes same color: not bipartite
15 return True
Running time analysis: O(m + n)
The analysis is essentially the same as for BFS:
- Initialization (lines 2-3): O(n) to set all colors to -1.
- Main loop (lines 7-14): Each node is enqueued at most once (once colored, it is never enqueued again). The total number of neighbor examinations across all iterations is the sum of all degrees = 2m. The additional check in line 13 adds only O(1) work per neighbor examination. So the total work in the loop is O(n + m).
- Total: O(n) + O(n + m) = O(m + n).
Correctness: A graph is bipartite if and only if it contains no odd-length cycles. BFS explores the graph level by level. If we assign alternating colors by level, two adjacent nodes will have the same color only when there is an edge between two nodes at the same BFS level, which corresponds to an odd cycle. If no such edge exists, the 2-coloring is valid and the graph is bipartite. Since the graph is assumed connected, a single BFS from any source visits all nodes.
Part C [10 pts]: Algorithm ComputeLargestCCSize
The idea is to iterate over all nodes, and for each unvisited node, run a BFS to find its entire connected component. We track the size of each component and return the maximum.
Algorithm ComputeLargestCCSize(g):
Input: Graph g = (V, E).
Output: Size (number of nodes) of the largest connected component.
1 foreach j in V do
2 visited_j <- False
3 largest <- 0
4 foreach s in V do
5 if visited_s = False then
6 // BFS from s to explore its connected component
7 count <- 0
8 visited_s <- True
9 Initialize Q to an empty queue
10 enqueue(Q, s)
11 while Q is not empty do
12 j <- dequeue(Q)
13 count <- count + 1
14 foreach neighbor h of j do
15 if visited_h = False then
16 visited_h <- True
17 enqueue(Q, h)
18 if count > largest then
19 largest <- count
20 return largest
Running time analysis: O(m + n)
- Initialization (lines 1-2): O(n) to initialize the visited array.
- Outer loop (lines 4-19): The loop iterates over all n nodes, but the BFS (lines 7-17) is only executed for nodes that have not yet been visited. Crucially, the visited array is shared across all BFS calls. Once a node is marked as visited in one BFS, it is never processed again in any subsequent BFS.
- Total nodes processed: Each node in V is dequeued exactly once across all BFS calls (it is visited in exactly one connected component). So the total number of dequeue operations is exactly n.
- Total edge examinations: When node j is dequeued, we examine all neighbors of j (line 14). Across all BFS calls, each node j is dequeued exactly once, so we examine each edge exactly twice total (once from each endpoint). The total work for edge examinations is thus O(2m) = O(m).
- Overhead: The outer loop's if-check (line 5) runs n times total, contributing O(n).
- Total: O(n) + O(n + m) = O(m + n).
---
Problem 3: Network Resilience [25 pts]
Implementation: compute_largest_cc_size
from collections import deque
def compute_largest_cc_size(g):
"""
Compute the size of the largest connected component in graph g.
Arguments:
g -- undirected graph as a dictionary (keys = nodes, values = sets of neighbors)
Returns:
The size (number of nodes) of the largest connected component in g.
"""
visited = set()
largest = 0
for node in g:
if node not in visited:
# BFS from this node to find its connected component
queue = deque([node])
visited.add(node)
count = 0
while queue:
j = queue.popleft()
count += 1
for h in g[j]:
if h not in visited:
visited.add(h)
queue.append(h)
if count > largest:
largest = count
return largest
This is a direct translation of Algorithm ComputeLargestCCSize from Problem 2C into Python. It runs in O(m + n) time as analyzed above.
Analysis Setup
To set up the experiments, we first analyze the real network graph (rf7.repr):
- Let n = number of nodes in the rf7 graph
- Let m = number of edges (computed as total_degree(g) / 2)
- Average degree = 2m / n
Parameter selection for random graphs:
- Erdos-Renyi: We use n nodes and edge probability p = (average degree) / (n - 1). This ensures the expected degree of each node in the ER graph matches the average degree of the real network.
- UPA (Undirected Preferential Attachment): We use n nodes and m_param = round(average degree / 2). In the UPA model, each new node connects to approximately m_param existing nodes, so the average degree converges to approximately 2 * m_param.
Analysis Code
import random
from collections import deque
from provided import erdos_renyi, upa, total_degree
from comp182 import read_graph, copy_graph, plot_lines
def compute_largest_cc_size(g):
visited = set()
largest = 0
for node in g:
if node not in visited:
queue = deque([node])
visited.add(node)
count = 0
while queue:
j = queue.popleft()
count += 1
for h in g[j]:
if h not in visited:
visited.add(h)
queue.append(h)
if count > largest:
largest = count
return largest
def random_attack(g):
"""Remove nodes in random order, recording largest CC size after each removal."""
g = copy_graph(g)
nodes = list(g.keys())
random.shuffle(nodes)
result = {}
result[0] = compute_largest_cc_size(g)
for i, node in enumerate(nodes):
# Remove node and all its edges
neighbors = g[node]
for neighbor in neighbors:
g[neighbor].discard(node)
del g[node]
result[i + 1] = compute_largest_cc_size(g)
return result
def targeted_attack(g):
"""Remove nodes in decreasing order of degree, recording largest CC size."""
g = copy_graph(g)
result = {}
result[0] = compute_largest_cc_size(g)
i = 0
while g:
# Find node with maximum degree
max_node = max(g.keys(), key=lambda node: len(g[node]))
neighbors = g[max_node]
for neighbor in neighbors:
g[neighbor].discard(max_node)
del g[max_node]
i += 1
result[i] = compute_largest_cc_size(g)
return result
# Load real network
rf7 = read_graph("rf7.repr")
n = len(rf7)
m = total_degree(rf7) // 2
avg_degree = 2 * m / n
# Generate random graphs with matching parameters
p = avg_degree / (n - 1)
m_param = int(round(avg_degree / 2))
er_graph = erdos_renyi(n, p)
upa_graph = upa(n, m_param)
# Run experiments (copy graphs so same graph used for both attacks)
rf7_random = random_attack(rf7)
rf7_targeted = targeted_attack(rf7)
er_random = random_attack(er_graph)
er_targeted = targeted_attack(er_graph)
upa_random = random_attack(upa_graph)
upa_targeted = targeted_attack(upa_graph)
# Plot results
plot_lines([rf7_random, er_random, upa_random],
"Random Attack: Largest CC Size vs Nodes Removed",
"Number of nodes removed", "Largest connected component size",
["RF7 (real)", "Erdos-Renyi", "UPA"],
"random_attack.png")
plot_lines([rf7_targeted, er_targeted, upa_targeted],
"Targeted Attack: Largest CC Size vs Nodes Removed",
"Number of nodes removed", "Largest connected component size",
["RF7 (real)", "Erdos-Renyi", "UPA"],
"targeted_attack.png")
Discussion
Resilience to Random Attacks
- Erdos-Renyi (ER) graph: Has a roughly uniform (Poisson) degree distribution. Under random attack, nodes of all degrees are equally likely to be removed. Since most nodes have similar degree (near the average), removing any one node has a modest effect. The largest connected component shrinks gradually and roughly linearly as nodes are removed. ER graphs are moderately resilient to random attacks.
- UPA (Preferential Attachment) graph: Has a power-law degree distribution with a few very high-degree hub nodes and many low-degree nodes. Under random attack, the vast majority of removed nodes will be low-degree nodes (since there are many more of them), and removing low-degree nodes has little effect on the connectivity of the network. The largest connected component remains large until a significant fraction of nodes is removed. UPA graphs are very resilient to random attacks.
- RF7 (real ISP network): Real-world ISP networks tend to have a degree distribution that is skewed (somewhat between ER and pure power-law). Their resilience to random attacks is typically moderate to good.
Resilience to Targeted Attacks
- Erdos-Renyi graph: Since all nodes have roughly similar degree, targeted attacks (removing highest-degree nodes first) are not much more effective than random attacks. The network degrades gradually. ER graphs are relatively resilient to targeted attacks.
- UPA (Preferential Attachment) graph: Targeted attacks are devastating because they remove the hub nodes first. Since the network's connectivity depends heavily on a small number of high-degree hubs, removing these hubs quickly fragments the network. After removing just a small percentage (around 1-5%) of nodes, the largest connected component can collapse dramatically. UPA graphs are very vulnerable to targeted attacks.
- RF7 (real ISP network): The real network's response to targeted attacks falls somewhere between ER and UPA, but is typically closer to UPA since real networks often have some hub structure.
Comparison and Expected Results
These results are expected based on the structural differences between the two random graph models:
- ER graphs have a homogeneous degree distribution (all nodes are "roughly equal"), which means no single node is critical. This makes them robust to targeted attacks but also means random attacks are relatively effective since every node matters equally.
- UPA graphs have a heterogeneous, power-law degree distribution. The network relies on a few critical hub nodes for connectivity. This creates an asymmetry: random attacks rarely hit the important hubs (resilient), but targeted attacks strike directly at the hubs (vulnerable).
Design Recommendations
If the main concern were targeted attacks, you would want to design your network topology to be as homogeneous as possible — closer to an Erdos-Renyi model where all nodes have similar degree. A regular graph (where every node has exactly the same degree) would be ideal from a pure resilience standpoint, as there would be no "special" targets for an attacker. Alternatively, you could add redundant connections so that no single hub's removal disconnects the network.
How well was the RF7 network designed? Real-world ISP networks must balance resilience with other constraints like geography, cost, latency, and bandwidth. The rf7 network likely has some hub structure for practical reasons (central routing points, backbone routers) that makes it somewhat vulnerable to targeted attacks. However, this hub-and-spoke structure is efficient for routing and cost. Compared to a pure preferential attachment network, the real network likely has more redundancy built in (multiple paths between major nodes), providing better targeted-attack resilience than a pure UPA graph while being more vulnerable than a pure ER graph. The design represents a practical trade-off between resilience and operational efficiency.
Analysis of the First 20% of Nodes Removed
When examining what happens as the first 20% of nodes are removed:
- Random attack: All three graphs (RF7, ER, UPA) show relatively gradual decline in the largest CC size. The UPA graph retains the largest CC because random removal mostly hits low-degree nodes. The ER graph shows a roughly linear decline. The RF7 graph behaves somewhere in between.
- Targeted attack: The UPA graph shows the most dramatic collapse — the largest CC size drops precipitously within the first few percent of nodes removed, as the critical hubs are eliminated. The ER graph shows a more gradual decline since its "highest degree" nodes are not much more connected than average. The RF7 graph again falls between the two, showing significant but not as extreme degradation as pure UPA.