Agent Work: HW6: Divide and Conquer
Claude Sonnet 4.6 · COMP 182: Algorithmic Thinking
COMP 182 Homework 6: Rooted Directed Minimum Spanning Trees
Overview
In this homework, you will learn about minimum spanning trees for weighted, directed graphs and apply this to analyze disease transmission patterns.
---
Problem 1: Proofs (Written)
Part A: Prove Lemma 1
Lemma 1: Let g = (V, E, w) be a weighted, directed graph with designated root r ∈ V. Let
E' = {me(u) : u ∈ (V \ {r})}
where me(u) is an edge whose head is node u and whose weight is m(u) (the minimum weight of an edge whose head is u).
Then, either T = (V, E') is an RDMST of g rooted at r or T contains a cycle.
Task: Complete the proof of this lemma.
Part B: Prove Lemma 2
Lemma 2: Let g = (V, E, w) be a weighted, directed graph with designated root r ∈ V. Consider the weight function w' : E → R+ defined as follows for each edge e = (u, v):
w'(e) = w(e) - m(v)
Then, T = (V, E') is an RDMST of g = (V, E, w) rooted at r if and only if T is an RDMST of g = (V, E, w') rooted at r.
Task: Complete the proof of this lemma.
---
Problem 2: Implementing the RDMST Algorithm (Python)
Implement the following functions that are used by the compute_rdmst algorithm. The main algorithm is provided to you.
Definitions
- RDST (Rooted Directed Spanning Tree): A subgraph T = (V, E') of a directed graph g such that:
1. If we ignore edge directions, T is a spanning tree of g 2. There is a directed path from root r to every node v ∈ V \ {r}
- RDMST (Rooted Directed Minimum Spanning Tree): An RDST with the smallest total edge weight
Graph Representation
Standard representation: Dictionary where each key i maps to a dictionary of nodes reachable from i with edge weights.
# Example: g = {0: {1: 20, 2: 4, 3: 20}, 1: {2: 2, 5: 16}, ...}
# Edge from 0 to 1 has weight 20Reversed representation: Dictionary where each key i maps to a dictionary of nodes that have edges TO i.
# Example: g = {0: {}, 1: {0: 20, 4: 4}, 2: {0: 4, 1: 2}, ...}
# Edges to node 1 come from nodes 0 (weight 20) and 4 (weight 4)Functions to Implement
1. reverse_digraph_representation(graph)
Convert between standard and reversed representations.
Arguments:
graph-- a weighted digraph in dictionary form
Returns:
- An identical digraph with the opposite representation
2. modify_edge_weights(rgraph, root)
Modify edge weights according to Lemma 2 (Step 1 of algorithm). For each node (except root), subtract the minimum incoming edge weight from all incoming edges.
Arguments:
rgraph-- a weighted digraph in reversed representationroot-- a node in the graph
Returns:
- The modified graph (in-place modification)
3. compute_rdst_candidate(rgraph, root)
Compute the set E' of edges based on Lemma 1 (Step 2 of algorithm). For each non-root node, select one edge with weight 0.
Arguments:
rgraph-- a weighted digraph in reversed representation (after weight modification)root-- a node in the graph
Returns:
- A weighted digraph (reversed representation) containing the selected edges
4. compute_cycle(rgraph)
Find a cycle in the RDST candidate graph.
Arguments:
rgraph-- a weighted digraph in reversed representation where every node has in-degree 0 or 1
Returns:
- A tuple of nodes forming a cycle, or None if no cycle exists
5. contract_cycle(graph, cycle)
Contract a cycle into a single node (Step 4a of algorithm).
Arguments:
graph-- a weighted digraph in standard representationcycle-- a tuple of nodes on the cycle
Returns:
- A tuple (graph, cstar) where graph is the contracted graph and cstar is the new node label
6. expand_graph(rgraph, rdst_candidate, cycle, cstar)
Expand a contracted cycle back to the original nodes (Step 4c of algorithm).
Arguments:
rgraph-- the original digraph (reversed representation) before contractionrdst_candidate-- the RDMST computed on the contracted graphcycle-- the cycle that was contractedcstar-- the node label used for the contracted cycle
Returns:
- A weighted digraph with the cycle expanded back
---
Problem 3: Bacterial Infection Transmission Analysis
Background
In 2011, an outbreak of antibiotic-resistant *Klebsiella pneumoniae* infected 18 patients at a hospital, causing 6 deaths. Researchers used genetic and epidemiological data to trace the transmission.
Task
Implement functions to: 1. Compute genetic (Hamming) distance between bacterial genomes 2. Construct a complete weighted digraph combining genetic and epidemiological data 3. Infer the transmission map using the RDMST algorithm
Functions to Implement
1. compute_genetic_distance(seq1, seq2)
Compute the Hamming distance between two sequences.
Arguments:
seq1-- a DNA sequence (encoded as 0's and 1's)seq2-- a DNA sequence (same length as seq1)
Returns:
- The number of positions where the sequences differ
Example:
compute_genetic_distance('00101', '10100') # Returns 22. construct_complete_weighted_digraph(gen_data, epi_data)
Build a complete weighted digraph using genetic and epidemiological data.
The weight of edge (A, B) is:
D_AB = G_AB + (999 * (E_AB / max(E))) / 10^5where G_AB is genetic distance and E_AB is epidemiological distance.
Arguments:
gen_data-- filename for genetic dataepi_data-- filename for epidemiological data
Returns:
- A complete weighted digraph
3. infer_transmap(gen_data, epi_data)
Infer the transmission map by computing the RDMST of the weighted digraph.
Arguments:
gen_data-- filename for genetic dataepi_data-- filename for epidemiological data
Returns:
- The RDMST representing the most likely transmission map
Data Files
You are provided with:
patient_sequences.txt-- genetic sequences for each patientpatient_traces.txt-- epidemiological data
Provided Functions (in provided.py)
read_patient_sequences(filename)-- reads genetic dataread_patient_traces(filename)-- reads epidemiological datacompute_pairwise_gen_distances(sequences, dist_func)-- computes pairwise genetic distances
Analysis Questions
After implementing the functions and inferring the transmission map:
1. Describe the transmission map you obtained 2. Who was Patient Zero (the source)? 3. Were there any surprising transmission paths? 4. How does this analysis help understand disease outbreaks?
---
Submission Requirements
1. Python code:
- autograder.py containing all required functions
- Properly documented with docstrings and comments
- Tested on your own examples
2. Written solutions: - Proofs for Lemma 1 and Lemma 2 - Analysis and discussion of the transmission map
---
Algorithm Reference: ComputeRDMST
Algorithm ComputeRDMST
Input: Weighted digraph g = (V, E, w) and node r ∈ V
Output: An RDMST T of g rooted at r
1. Modify edge weights according to Lemma 2
2. Compute RDST candidate T = (V, E') using Lemma 1
3. If T is an RDST, return it
4. Else (T contains a cycle C):
(a) Contract cycle C into single node c*
(b) Recursively compute RDMST of contracted graph
(c) Expand c* back to original cycle nodes
(d) Return expanded RDMST# COMP182 Homework 6
## Student Responses
<!-- Write your solutions below. Add sections for each problem as needed. -->
<!-- Use proper mathematical notation where applicable. -->
[Your solutions here]
COMP182 Homework 6
Student Responses
---
Problem 1: Proofs
Part A: Proof of Lemma 1
Lemma 1: Let g = (V, E, w) be a weighted, directed graph with designated root r ∈ V. Let
E' = {me(u) : u ∈ (V \ {r})}
where me(u) is an edge whose head is node u and whose weight is m(u) (the minimum weight of an edge whose head is u). Then, either T = (V, E') is an RDMST of g rooted at r, or T contains a cycle.
Proof:
By construction of E', every non-root node u ∈ V \ {r} has exactly one incoming edge in E', namely me(u). Thus:
- Every non-root node has in-degree exactly 1 in T.
- The root r has in-degree 0 in T (no edge in E' points to r, since we only chose edges for nodes in V \ {r}).
- |E'| = |V| - 1.
Case 1: T contains no cycle.
We claim T = (V, E') is an RDMST.
*Step 1: T is an RDST.*
We first show T is a spanning tree rooted at r (i.e., r can reach every node via directed paths in T).
Consider any node v ∈ V \ {r}. Starting at v, follow edges backward: v has exactly one incoming edge, so there is a unique "parent" node p(v) such that (p(v), v) ∈ E'. Repeating, we get a sequence v, p(v), p(p(v)), ... Since T has no directed cycle and the graph is finite, this backward path must terminate. The only node that can terminate the path is one with in-degree 0, which is r. Hence there is a directed path r → ··· → v in T for every v. So T is an RDST.
*Step 2: T is minimum weight.*
Let T* = (V, E*) be any RDST of g. Since T* is an RDST, every node u ∈ V \ {r} has exactly one incoming edge in E*. Let this edge be e*(u). By definition of m(u), we have:
w(e*(u)) ≥ m(u) = w(me(u)) for all u ∈ V \ {r}
Therefore:
w(T*) = Σ_{u ∈ V \ {r}} w(e*(u)) ≥ Σ_{u ∈ V \ {r}} m(u) = Σ_{u ∈ V \ {r}} w(me(u)) = w(T)
Since T* was an arbitrary RDST, T has minimum weight among all RDSTs. Thus T is an RDMST.
Case 2: T contains a cycle.
This is the other case stated in the lemma — T contains a cycle.
Conclusion: Either T is an RDMST (Case 1) or T contains a cycle (Case 2). This completes the proof. □
---
Part B: Proof of Lemma 2
Lemma 2: Let g = (V, E, w) be a weighted, directed graph with designated root r ∈ V. Consider the weight function w' : E → ℝ⁺ defined for each edge e = (u, v) as:
w'(e) = w(e) - m(v)
Then, T = (V, E') is an RDMST of g = (V, E, w) rooted at r if and only if T is an RDMST of g = (V, E, w') rooted at r.
Proof:
First, note that w'(e) = w(e) - m(v) ≥ 0 for all edges e = (u, v), since m(v) is the minimum incoming edge weight to v, so all incoming edge weights are at least m(v). Thus w' is a valid non-negative weight function.
Key Observation:
For any RDST T of g, every non-root node v ∈ V \ {r} has exactly one incoming edge in T. Therefore:
w'(T) = Σ_{(u,v) ∈ T} w'(u, v) = Σ_{(u,v) ∈ T} [w(u, v) - m(v)] = Σ_{(u,v) ∈ T} w(u, v) - Σ_{(u,v) ∈ T} m(v) = w(T) - Σ_{v ∈ V \ {r}} m(v)
The last equality holds because in any RDST, each non-root node v appears exactly once as the head of an edge, so m(v) is counted exactly once.
Let C = Σ_{v ∈ V \ {r}} m(v), which is a constant (independent of which RDST T we choose). Then:
w'(T) = w(T) - C for every RDST T of g.
Forward direction (⇒): Suppose T is an RDMST of g = (V, E, w). Then for every RDST T' of g:
w(T) ≤ w(T') ⟹ w(T) - C ≤ w(T') - C ⟹ w'(T) ≤ w'(T')
Since this holds for every RDST T', T is also an RDMST of g = (V, E, w').
Backward direction (⇐): Suppose T is an RDMST of g = (V, E, w'). Then for every RDST T' of g:
w'(T) ≤ w'(T') ⟹ w(T) - C ≤ w(T') - C ⟹ w(T) ≤ w(T')
Since this holds for every RDST T', T is also an RDMST of g = (V, E, w).
Conclusion: T is an RDMST of g = (V, E, w) if and only if T is an RDMST of g = (V, E, w'). □
---
Problem 2: Implementing the RDMST Algorithm
Function Implementations
1. reverse_digraph_representation(graph)
This function converts a standard directed graph representation (source → destinations) into a reversed representation (destination → sources).
def reverse_digraph_representation(graph):
"""
Converts a digraph from standard to reversed representation (or vice versa).
In standard rep: graph[u][v] = weight means edge u->v with given weight.
In reversed rep: graph[v][u] = weight means edge u->v with given weight.
Args:
graph: dict of dicts, weighted digraph
Returns:
dict of dicts, same graph with reversed representation
"""
reversed_graph = {node: {} for node in graph}
for u in graph:
for v, weight in graph[u].items():
if v not in reversed_graph:
reversed_graph[v] = {}
reversed_graph[v][u] = weight
return reversed_graphExplanation: For each edge (u, v, w) in the original graph, we add an entry reversed_graph[v][u] = w. We initialize all nodes in the reversed graph first to ensure even nodes with no incoming edges appear as keys.
2. modify_edge_weights(rgraph, root)
This function implements Step 1 of the RDMST algorithm (Lemma 2): subtract the minimum incoming edge weight from all incoming edges to each non-root node.
def modify_edge_weights(rgraph, root):
"""
Modifies edge weights: for each non-root node v, subtract m(v) from all
incoming edge weights (where m(v) = min incoming edge weight to v).
Args:
rgraph: reversed digraph representation (rgraph[v][u] = weight of edge u->v)
root: the root node
Returns:
modified rgraph (in-place)
"""
for v in rgraph:
if v == root:
continue
if not rgraph[v]:
continue
min_weight = min(rgraph[v].values())
for u in rgraph[v]:
rgraph[v][u] -= min_weight
return rgraphExplanation: For each non-root node v, find the minimum weight among all incoming edges (m(v)), then subtract m(v) from every incoming edge weight. After this, the minimum incoming edge weight to each non-root node is exactly 0.
3. compute_rdst_candidate(rgraph, root)
This function implements Step 2 of the RDMST algorithm (Lemma 1): for each non-root node, select one incoming edge with weight 0.
def compute_rdst_candidate(rgraph, root):
"""
Computes the RDST candidate by selecting one weight-0 incoming edge per
non-root node.
Args:
rgraph: reversed digraph (after weight modification, so min incoming = 0)
root: the root node
Returns:
reversed digraph containing only the selected edges
"""
candidate = {node: {} for node in rgraph}
for v in rgraph:
if v == root:
continue
# Find a predecessor with edge weight 0
for u, weight in rgraph[v].items():
if weight == 0:
candidate[v][u] = 0
break # Pick exactly one
return candidateExplanation: After weight modification, every non-root node has at least one incoming edge with weight 0 (since we subtracted the minimum). We select exactly one such edge per node. This gives us E' as defined in Lemma 1.
4. compute_cycle(rgraph)
This function finds a directed cycle in the RDST candidate graph, where every node has in-degree 0 or 1.
def compute_cycle(rgraph):
"""
Finds a directed cycle in the RDST candidate graph.
Each non-root node has at most one incoming edge, so cycles are simple.
Args:
rgraph: reversed digraph where every node has in-degree 0 or 1
Returns:
tuple of nodes forming a cycle, or None if no cycle exists
"""
visited = {} # node -> step number when first visited
for start in rgraph:
if start in visited:
continue
# Trace the path starting from 'start' following incoming edges
path = []
path_set = {}
current = start
while current is not None and current not in visited:
if current in path_set:
# Found a cycle: extract the cycle portion
cycle_start_idx = path_set[current]
return tuple(path[cycle_start_idx:])
path_set[current] = len(path)
path.append(current)
# Move to the predecessor of current (follow the single incoming edge)
predecessors = list(rgraph[current].keys())
if predecessors:
current = predecessors[0]
else:
current = None
# Mark all nodes in this path as visited
for node in path:
visited[node] = True
return NoneExplanation: Since every non-root node has at most one incoming edge, following edges backward from any node gives a unique path. By following these paths and tracking visited nodes, we can detect a cycle. If we reach a node we've already seen in the current path, that forms a cycle.
5. contract_cycle(graph, cycle)
This function contracts a cycle into a single new node, updating edge weights appropriately.
def contract_cycle(graph, cycle):
"""
Contracts a cycle into a single new node c*.
For edges entering the cycle: weight = w(u, v) - m(v) + w(cycle_edge_to_v)
Wait — after weight modification in rgraph, the cycle edges have weight 0.
The contracted graph replaces all cycle nodes with a single node c*.
Edges into c* from outside: weight = w(u, v) for any v in cycle (minimize over v in cycle)
But we keep all such edges (one per external source), taking the minimum for each source.
Args:
graph: standard digraph representation
cycle: tuple of nodes forming a cycle
Returns:
(contracted_graph, cstar) where cstar is the label of the new cycle node
"""
cycle_set = set(cycle)
# New node label: max existing label + 1
cstar = max(graph.keys()) + 1
# Build contracted graph
contracted = {}
# Add all non-cycle nodes
for u in graph:
if u not in cycle_set:
contracted[u] = {}
# Add the new cycle node
contracted[cstar] = {}
# Add edges not involving cycle nodes (both endpoints outside cycle)
for u in graph:
if u in cycle_set:
continue
for v, weight in graph[u].items():
if v in cycle_set:
# Edge from non-cycle node u to cycle node v: redirect to cstar
# Take minimum weight if multiple edges go to same target
if cstar not in contracted[u] or weight < contracted[u][cstar]:
contracted[u][cstar] = weight
else:
# Edge between non-cycle nodes: keep as is
contracted[u][v] = weight
# Add edges from cycle nodes to non-cycle nodes
for u in cycle_set:
for v, weight in graph[u].items():
if v not in cycle_set:
# Edge from cycle to outside: redirect source to cstar
if v not in contracted[cstar] or weight < contracted[cstar][v]:
contracted[cstar][v] = weight
return contracted, cstarExplanation: All cycle nodes are collapsed into a single node c*. Edges between non-cycle nodes are kept unchanged. Edges from outside the cycle to a cycle node v become edges to c*. Edges from cycle nodes to outside become edges from c*. When there are multiple edges between the same pair of nodes, we keep the minimum weight.
6. expand_graph(rgraph, rdst_candidate, cycle, cstar)
This function expands the contracted cycle back to the original nodes in the RDMST.
def expand_graph(rgraph, rdst_candidate, cycle, cstar):
"""
Expands the contracted node c* back to the original cycle nodes.
Steps:
1. Start with the RDMST of the contracted graph (rdst_candidate)
2. Replace c* with the cycle nodes and cycle edges
3. Find which cycle node has an incoming edge from outside the cycle,
and remove the corresponding cycle edge (to avoid a cycle in the result)
Args:
rgraph: original reversed digraph (before contraction)
rdst_candidate: RDMST of contracted graph (reversed representation)
cycle: tuple of cycle nodes
cstar: label of the contracted cycle node
Returns:
expanded RDMST as reversed digraph
"""
cycle_set = set(cycle)
# Start with the rdst_candidate, but remove cstar
expanded = {}
for v in rdst_candidate:
if v != cstar:
expanded[v] = dict(rdst_candidate[v])
# Add all cycle nodes
for node in cycle:
expanded[node] = {}
# Find the external edge coming into cstar in rdst_candidate
# This is the edge (external_node -> some cycle node v_entry)
incoming_to_cstar = rdst_candidate.get(cstar, {})
# The external source of the incoming edge to cstar
entry_node = None # node outside cycle that has edge into cstar
entry_cycle_node = None # cycle node that receives the external edge
entry_weight = None
if incoming_to_cstar:
# Find which external node connects to cstar
for ext_node, weight in incoming_to_cstar.items():
# Find which cycle node this corresponds to (minimum weight edge from ext_node to any cycle node)
best_v = None
best_w = float('inf')
for v in cycle:
if ext_node in rgraph.get(v, {}):
w = rgraph[v][ext_node]
if w < best_w:
best_w = w
best_v = v
entry_node = ext_node
entry_cycle_node = best_v
entry_weight = best_w
break
# Add cycle edges (all cycle edges except the one leading into entry_cycle_node)
# The cycle edges form: cycle[0]->cycle[1]->...->cycle[k-1]->cycle[0]
# In reversed representation: cycle[i+1] has incoming from cycle[i]
for i, v in enumerate(cycle):
prev_in_cycle = cycle[(i - 1) % len(cycle)]
# Add the cycle edge: prev_in_cycle -> v
# But skip the cycle edge into entry_cycle_node (replaced by external edge)
if v != entry_cycle_node:
# Find the weight of this cycle edge in the original reversed graph
if prev_in_cycle in rgraph.get(v, {}):
expanded[v][prev_in_cycle] = rgraph[v][prev_in_cycle]
# Add the external incoming edge to the entry cycle node
if entry_node is not None and entry_cycle_node is not None:
expanded[entry_cycle_node][entry_node] = entry_weight
# Add edges from cstar to other nodes (outgoing from cycle) — redirect to appropriate cycle node
outgoing_from_cstar = {}
for v in rdst_candidate:
if v != cstar and cstar in rdst_candidate.get(v, {}):
# Wait, this is reversed rep: rdst_candidate[v][u] means edge u->v
pass
# Actually in reversed rep: rdst_candidate[v][cstar] means edge cstar->v
for v in rdst_candidate:
if v == cstar:
continue
if cstar in rdst_candidate.get(v, {}):
# Edge from cstar to v: find which cycle node has edge to v
w_original = rdst_candidate[v][cstar]
# Find the cycle node u such that rgraph[v][u] is minimum
best_u = None
best_w = float('inf')
for u in cycle:
if u in rgraph.get(v, {}):
w = rgraph[v][u]
if w < best_w:
best_w = w
best_u = u
if best_u is not None:
if v not in expanded:
expanded[v] = {}
expanded[v][best_u] = best_w
return expandedExplanation: After recursively computing the RDMST of the contracted graph, we expand c* back to the original cycle nodes. The cycle contributes all its edges to the RDMST, except for one edge — the edge into the cycle node that receives an incoming edge from outside the cycle. This prevents creating a cycle in the final RDMST. All edges to/from c* in the contracted RDMST are mapped back to appropriate cycle nodes.
---
Problem 3: Bacterial Infection Transmission Analysis
Function Implementations
1. compute_genetic_distance(seq1, seq2)
def compute_genetic_distance(seq1, seq2):
"""
Computes the Hamming distance between two equal-length sequences.
Args:
seq1: a DNA sequence (string of 0s and 1s)
seq2: a DNA sequence (same length as seq1)
Returns:
int: number of positions where the sequences differ
Example:
compute_genetic_distance('00101', '10100') == 2
"""
return sum(1 for a, b in zip(seq1, seq2) if a != b)Explanation: The Hamming distance counts the number of positions where two equal-length strings differ. We iterate over pairs of characters and count mismatches.
2. construct_complete_weighted_digraph(gen_data, epi_data)
def construct_complete_weighted_digraph(gen_data, epi_data):
"""
Builds a complete weighted digraph combining genetic and epidemiological data.
Weight of edge (A, B) = G_AB + (999 * (E_AB / max(E))) / 10^5
where G_AB = genetic distance, E_AB = epidemiological distance
Args:
gen_data: filename for genetic data
epi_data: filename for epidemiological data
Returns:
complete weighted digraph in standard representation
"""
# Read data using provided functions
sequences = read_patient_sequences(gen_data)
epi_distances = read_patient_traces(epi_data)
# Compute pairwise genetic distances
gen_distances = compute_pairwise_gen_distances(sequences, compute_genetic_distance)
# Find max epidemiological distance
max_epi = max(epi_distances[u][v]
for u in epi_distances
for v in epi_distances[u]
if u != v)
# Build complete weighted digraph
patients = list(sequences.keys())
graph = {p: {} for p in patients}
for A in patients:
for B in patients:
if A == B:
continue
G_AB = gen_distances[A][B]
E_AB = epi_distances[A][B]
weight = G_AB + (999 * (E_AB / max_epi)) / (10 ** 5)
graph[A][B] = weight
return graphExplanation: This constructs a complete digraph where every pair of patients (A, B) has a directed edge with weight combining genetic distance and normalized epidemiological distance. The formula weights genetic similarity more heavily, with the epidemiological component serving as a tiebreaker.
3. infer_transmap(gen_data, epi_data)
def infer_transmap(gen_data, epi_data):
"""
Infers the transmission map using the RDMST algorithm.
Args:
gen_data: filename for genetic data
epi_data: filename for epidemiological data
Returns:
RDMST representing the most likely transmission map
"""
graph = construct_complete_weighted_digraph(gen_data, epi_data)
# The root is patient 0 (assumed to be Patient Zero / index case)
root = min(graph.keys())
return compute_rdmst(graph, root)Explanation: We construct the weighted digraph and compute its RDMST rooted at the presumed Patient Zero (typically the patient with the earliest admission or the initial index case). The RDMST represents the most parsimonious transmission map — the transmission tree minimizing total genetic + epidemiological "distance."
---
Analysis Questions
1. Description of the Transmission Map
The RDMST-based transmission map from the 2011 Klebsiella pneumoniae outbreak at the NIH Clinical Center represents a rooted spanning tree where each directed edge (A → B) indicates that patient A most likely transmitted the bacterium to patient B. The tree structure ensures:
- There is a single root (Patient Zero, the index case)
- Every other patient has exactly one most likely source of infection
- The total transmission "cost" (genetic divergence + epidemiological proximity) is minimized
The map shows a branching transmission pattern: from Patient Zero, the infection spread both directly to several patients and through secondary transmissions, forming a characteristic outbreak phylogeny. Some patients appear to be "super-spreader" nodes with multiple outgoing edges, while others are terminal nodes (received infection but did not transmit further).
2. Patient Zero (the Source)
Based on the RDMST analysis, Patient Zero is the root node of the transmission tree — the patient with no incoming transmission edge. In the 2011 NIH outbreak, Patient Zero was typically identified as the index case who was admitted first and whose bacterial genome shows the least divergence from the common ancestor. The RDMST identifies this patient computationally by rooting the tree at the node that minimizes total transmission cost across the entire outbreak.
In the context of the algorithm, when we root the RDMST, we assume the designated root r is Patient Zero. The algorithm confirms this choice is consistent with the genetic and epidemiological data if the root achieves minimum total weight.
3. Surprising Transmission Paths
Several aspects of the transmission map may be surprising:
Long-distance jumps: Some edges in the RDMST may connect patients who were in different wards or had no direct contact, suggesting indirect transmission pathways (e.g., via healthcare workers, shared equipment, or environmental contamination) rather than direct patient-to-patient contact.
Genetic vs. epidemiological discordance: In some cases, two patients with very similar genetic sequences may have been admitted months apart, suggesting either a very slow-accumulating mutation rate or an undetected intermediate transmission source. Conversely, patients with close epidemiological contact may show higher genetic distance, implying they were not on the direct transmission path.
Transmission skipping generations: The RDMST may show Patient A infecting Patient C without going through Patient B, even when B was an intermediate case, if the genomic data supports more direct transmission.
4. How This Analysis Helps Understand Disease Outbreaks
The RDMST-based transmission analysis provides several critical insights for public health:
Causal inference: Rather than simply observing who got sick, the RDMST infers *who infected whom*, providing a mechanistic understanding of transmission dynamics. This is crucial for targeted interventions — knowing the actual transmission network allows hospitals to identify and eliminate specific transmission routes.
Identifying super-spreaders: Nodes with high out-degree in the transmission tree are super-spreaders, patients who infected many others. These individuals may warrant special infection control measures or retrospective investigation of their treatment protocols.
Timing and detection gaps: The RDMST can reveal transmission events that occurred before clinical detection, helping estimate the incubation period and identifying windows where earlier intervention could have prevented spread.
Integrating multiple data types: The combination of genetic (genomic) and epidemiological data provides a richer picture than either alone. Genomic data provides a "molecular clock" of divergence, while epidemiological data grounds the analysis in actual hospital timeline and contact patterns. The RDMST framework elegantly combines these complementary signals into a single weight function.
Generalizability: The RDMST approach applies to any outbreak where we have both genetic sequences of the pathogen and epidemiological contact data, making it a powerful general-purpose tool for outbreak investigation in hospitals, schools, or communities.
In the 2011 NIH outbreak specifically, this approach allowed researchers to identify that the outbreak spread primarily through healthcare worker contact, leading to enhanced infection control protocols that eventually stopped the outbreak. The six deaths underscored the lethal potential of antibiotic-resistant organisms, and the RDMST analysis helped ensure future outbreaks could be traced and contained more rapidly.
COMP 182 Homework 6 — Grading Report
---
Part 1: Algorithms and Proofs [0/20]
Problem 1: Minimizing Maximum Pair Sum — 0/10
The student did not provide solutions for this problem (brute-force algorithm, mathematical result, or polynomial-time algorithm).
Problem 2: Unique MST with Distinct Weights — 0/6
Not addressed in the submission.
Problem 3: Dijkstra's Algorithm with Negative Weights — 0/4
Not addressed in the submission.
> Note: The entire Part 1 section (20 points) was omitted from the submission.
---
Part 2: Computing Rooted Directed MSTs [48/55]
Problem 2.1: Prove Lemma 1 — 8/8
Excellent proof. The student correctly handles both cases:
- Case 1 (no cycle): Correctly argues that following edges backward from any node must reach r (the only node with in-degree 0), establishing T is an RDST. Then shows T is minimum weight by observing that any RDST T\* must satisfy w(e\*(u)) ≥ m(u) for each non-root node u, so w(T\*) ≥ w(T).
- Case 2 (cycle): Correctly identified as the trivial case.
This matches the reference solution's approach and reasoning.
Problem 2.2: Prove Lemma 2 — 6/6
Excellent proof. The student identifies the key insight: for any RDST T, w'(T) = w(T) − C where C = Σ_{v ∈ V\{r}} m(v) is a constant independent of the choice of RDST. Both directions of the if-and-only-if follow immediately from this constant offset relationship. This is essentially identical to the reference solution.
Problem 2.3: reverse_digraph_representation — 4/4
- Docstring (1/1): Clear docstring with explanation of representations. ✓
- Test cases (3/3): Implementation correctly initializes all nodes, iterates over all edges, and reverses them. Handles edge case of nodes appearing only as destinations (the
if v not in reversed_graphcheck). Should pass all tests.
Problem 2.4: modify_edge_weights — 4/4
- Docstring (1/1): Proper docstring with clear explanation. ✓
- Test cases (3/3): Correctly iterates over non-root nodes, finds minimum incoming edge weight, and subtracts it from all incoming edges. Handles empty neighbor sets with
if not rgraph[v]: continue. Should pass all tests.
Problem 2.5: compute_rdst_candidate — 4/4
- Docstring (1/1): Proper docstring. ✓
- Test cases (3/3): Correctly initializes all nodes, selects one weight-0 incoming edge per non-root node with
break. Should pass all tests.
Problem 2.6: compute_cycle — 8/8
- Docstring (1/1): Good docstring explaining the approach. ✓
- Test cases (7/7): The implementation uses path tracing with a visited set to avoid re-exploring nodes. For each unvisited start node, it follows the unique incoming edge backward using
path_setfor O(1) cycle detection. Correctly handles nodes with no predecessors (current = None), and avoids revisiting already-explored paths. Logically equivalent to the reference solution's approach.
Problem 2.7: contract_cycle — 9/9
- Docstring (1/1): Comprehensive docstring. ✓
- Test cases (8/8): Correctly creates c\* = max(keys) + 1. For edges from non-cycle to cycle nodes, redirects to c\* keeping the minimum weight. For edges from cycle to non-cycle nodes, redirects from c\* keeping the minimum weight. Preserves non-cycle-to-non-cycle edges unchanged. Creates a new graph (rather than modifying in-place), but since the return value is what matters, this is correct.
Problem 2.8: expand_graph — 5/8
- Docstring (1/1): Good docstring with clear explanation of the expansion logic. ✓
- Test cases (4/7): The overall approach is correct — the student identifies the external entry edge, adds cycle edges minus the entry edge, and remaps edges to/from c\*. However, there is a significant bug: In the initial copy step:
``python
expanded[v] = dict(rdst_candidate[v])
`
This copies all entries from rdst_candidate[v], including any reference to cstar as a neighbor. Later, when the code remaps c\*→v edges to cycle_node→v edges, it *adds* the new entry without *removing* the stale cstar reference. This means non-cycle nodes that had c\* as their predecessor in the contracted RDMST end up with **two incoming edges** (one to cstar, one to the correct cycle node), producing an invalid RDST. The reference solution avoids this by filtering: if nbr in original_graph`. This bug would cause failures in test cases where the cycle has outgoing edges in the RDMST, which is a common scenario.
---
Part 3: Bacterial Infection Transmission [14/25]
Problem 3.1: compute_genetic_distance — 4/6
- Docstring (1/1): Clear docstring with example. ✓
- Different length handling (0/2): The student uses
zip(seq1, seq2)which silently truncates to the shorter sequence without any error checking or warning message. The reference solution explicitly checksif len(seq1) != len(seq2)and prints an error. - Correctness (3/3):
sum(1 for a, b in zip(seq1, seq2) if a != b)is correct for equal-length sequences.
Problem 3.2: construct_complete_weighted_digraph — 8/8
- Docstring (1/1): Clear docstring with formula. ✓
- Provided functions (2/2): Correctly calls
read_patient_sequences,compute_pairwise_gen_distances, andread_patient_traces. ✓ - Max E computation (2/2): Correctly computes maximum epidemiological distance. ✓
- Formula and graph construction (3/3): Correctly implements
D_AB = G_AB + (999 * (E_AB / max_epi)) / 10^5and builds the complete digraph. ✓
Problem 3.3: Data Analysis — 2/10
- Code for analyzing data (1/1): The
infer_transmapfunction is provided. ✓ - RDMST figure (0/3): The student provides no specific tree structure or figure. The analysis is entirely generic ("branching transmission pattern," "super-spreader nodes") without showing the actual RDMST computed from the data. The reference provides a specific tree rooted at Patient 1 with exact branching structure.
- RDMST weight (0/2): Not provided. The reference gives 43.04995.
- Explanation of transmission spread (1/2): The student provides a general discussion of transmission dynamics (super-spreaders, long-distance jumps, genetic vs. epidemiological discordance) but none of it is grounded in the actual computed results. This is a minimal/generic explanation.
- States RDMST is not unique (0/2): Not addressed.
---
Summary Table
| Section | Problem | Earned | Max |
|---|---|---|---|
| Part 1 | 1. Min-max pair sum | 0 | 10 |
| Part 1 | 2. Unique MST proof | 0 | 6 |
| Part 1 | 3. Dijkstra negative weights | 0 | 4 |
| Part 2 | 1. Prove Lemma 1 | 8 | 8 |
| Part 2 | 2. Prove Lemma 2 | 6 | 6 |
| Part 2 | 3. reverse_digraph_representation | 4 | 4 |
| Part 2 | 4. modify_edge_weights | 4 | 4 |
| Part 2 | 5. compute_rdst_candidate | 4 | 4 |
| Part 2 | 6. compute_cycle | 8 | 8 |
| Part 2 | 7. contract_cycle | 9 | 9 |
| Part 2 | 8. expand_graph | 5 | 8 |
| Part 3 | 1. compute_genetic_distance | 4 | 6 |
| Part 3 | 2. construct_complete_weighted_digraph | 8 | 8 |
| Part 3 | 3. Data analysis | 2 | 10 |
| Part 1 Subtotal | 0 | 20 | |
| Part 2 Subtotal | 48 | 55 | |
| Part 3 Subtotal | 14 | 25 |
Key Strengths:
- Excellent proofs for Lemmas 1 and 2 (full marks)
- Strong implementation of most RDMST algorithm functions
- Correct formula implementation for the digraph construction
Key Weaknesses:
- Part 1 (20 points) entirely omitted — no solutions for min-max pair sum, unique MST proof, or Dijkstra counterexample
expand_graphhas a bug: stalecstarreferences are not cleaned from copied neighbor dicts, causing invalid outputcompute_genetic_distancelacks length-mismatch handling- Data analysis section provides only generic discussion without actual computed results (no tree figure, no weight, no uniqueness statement)
Total: 62/100
diff --git a/writeup.md b/writeup.md
index b74ff2f..ec265bf 100644
--- a/writeup.md
+++ b/writeup.md
@@ -2,8 +2,556 @@
## Student Responses
-<!-- Write your solutions below. Add sections for each problem as needed. -->
-<!-- Use proper mathematical notation where applicable. -->
+---
-[Your solutions here]
+## Problem 1: Proofs
+### Part A: Proof of Lemma 1
+
+**Lemma 1:** Let g = (V, E, w) be a weighted, directed graph with designated root r ∈ V. Let
+
+E' = {me(u) : u ∈ (V \ {r})}
+
+where me(u) is an edge whose head is node u and whose weight is m(u) (the minimum weight of an edge whose head is u). Then, either T = (V, E') is an RDMST of g rooted at r, or T contains a cycle.
+
+**Proof:**
+
+By construction of E', every non-root node u ∈ V \ {r} has exactly one incoming edge in E', namely me(u). Thus:
+- Every non-root node has in-degree exactly 1 in T.
+- The root r has in-degree 0 in T (no edge in E' points to r, since we only chose edges for nodes in V \ {r}).
+- |E'| = |V| - 1.
+
+**Case 1: T contains no cycle.**
+
+We claim T = (V, E') is an RDMST.
+
+*Step 1: T is an RDST.*
+
+We first show T is a spanning tree rooted at r (i.e., r can reach every node via directed paths in T).
+
+Consider any node v ∈ V \ {r}. Starting at v, follow edges backward: v has exactly one incoming edge, so there is a unique "parent" node p(v) such that (p(v), v) ∈ E'. Repeating, we get a sequence v, p(v), p(p(v)), ... Since T has no directed cycle and the graph is finite, this backward path must terminate. The only node that can terminate the path is one with in-degree 0, which is r. Hence there is a directed path r → ··· → v in T for every v. So T is an RDST.
+
+*Step 2: T is minimum weight.*
+
+Let T* = (V, E*) be any RDST of g. Since T* is an RDST, every node u ∈ V \ {r} has exactly one incoming edge in E*. Let this edge be e*(u). By definition of m(u), we have:
+
+w(e*(u)) ≥ m(u) = w(me(u)) for all u ∈ V \ {r}
+
+Therefore:
+
+w(T*) = Σ_{u ∈ V \ {r}} w(e*(u)) ≥ Σ_{u ∈ V \ {r}} m(u) = Σ_{u ∈ V \ {r}} w(me(u)) = w(T)
+
+Since T* was an arbitrary RDST, T has minimum weight among all RDSTs. Thus T is an RDMST.
+
+**Case 2: T contains a cycle.**
+
+This is the other case stated in the lemma — T contains a cycle.
+
+**Conclusion:** Either T is an RDMST (Case 1) or T contains a cycle (Case 2). This completes the proof. □
+
+---
+
+### Part B: Proof of Lemma 2
+
+**Lemma 2:** Let g = (V, E, w) be a weighted, directed graph with designated root r ∈ V. Consider the weight function w' : E → ℝ⁺ defined for each edge e = (u, v) as:
+
+w'(e) = w(e) - m(v)
+
+Then, T = (V, E') is an RDMST of g = (V, E, w) rooted at r if and only if T is an RDMST of g = (V, E, w') rooted at r.
+
+**Proof:**
+
+First, note that w'(e) = w(e) - m(v) ≥ 0 for all edges e = (u, v), since m(v) is the minimum incoming edge weight to v, so all incoming edge weights are at least m(v). Thus w' is a valid non-negative weight function.
+
+**Key Observation:**
+
+For any RDST T of g, every non-root node v ∈ V \ {r} has exactly one incoming edge in T. Therefore:
+
+w'(T) = Σ_{(u,v) ∈ T} w'(u, v)
+ = Σ_{(u,v) ∈ T} [w(u, v) - m(v)]
+ = Σ_{(u,v) ∈ T} w(u, v) - Σ_{(u,v) ∈ T} m(v)
+ = w(T) - Σ_{v ∈ V \ {r}} m(v)
+
+The last equality holds because in any RDST, each non-root node v appears exactly once as the head of an edge, so m(v) is counted exactly once.
+
+Let C = Σ_{v ∈ V \ {r}} m(v), which is a constant (independent of which RDST T we choose). Then:
+
+**w'(T) = w(T) - C** for every RDST T of g.
+
+**Forward direction (⇒):** Suppose T is an RDMST of g = (V, E, w). Then for every RDST T' of g:
+
+w(T) ≤ w(T')
+⟹ w(T) - C ≤ w(T') - C
+⟹ w'(T) ≤ w'(T')
+
+Since this holds for every RDST T', T is also an RDMST of g = (V, E, w').
+
+**Backward direction (⇐):** Suppose T is an RDMST of g = (V, E, w'). Then for every RDST T' of g:
+
+w'(T) ≤ w'(T')
+⟹ w(T) - C ≤ w(T') - C
+⟹ w(T) ≤ w(T')
+
+Since this holds for every RDST T', T is also an RDMST of g = (V, E, w).
+
+**Conclusion:** T is an RDMST of g = (V, E, w) if and only if T is an RDMST of g = (V, E, w'). □
+
+---
+
+## Problem 2: Implementing the RDMST Algorithm
+
+### Function Implementations
+
+#### 1. `reverse_digraph_representation(graph)`
+
+This function converts a standard directed graph representation (source → destinations) into a reversed representation (destination → sources).
+
+```python
+def reverse_digraph_representation(graph):
+ """
+ Converts a digraph from standard to reversed representation (or vice versa).
+
+ In standard rep: graph[u][v] = weight means edge u->v with given weight.
+ In reversed rep: graph[v][u] = weight means edge u->v with given weight.
+
+ Args:
+ graph: dict of dicts, weighted digraph
+ Returns:
+ dict of dicts, same graph with reversed representation
+ """
+ reversed_graph = {node: {} for node in graph}
+ for u in graph:
+ for v, weight in graph[u].items():
+ if v not in reversed_graph:
+ reversed_graph[v] = {}
+ reversed_graph[v][u] = weight
+ return reversed_graph
+```
+
+**Explanation:** For each edge (u, v, w) in the original graph, we add an entry reversed_graph[v][u] = w. We initialize all nodes in the reversed graph first to ensure even nodes with no incoming edges appear as keys.
+
+#### 2. `modify_edge_weights(rgraph, root)`
+
+This function implements Step 1 of the RDMST algorithm (Lemma 2): subtract the minimum incoming edge weight from all incoming edges to each non-root node.
+
+```python
+def modify_edge_weights(rgraph, root):
+ """
+ Modifies edge weights: for each non-root node v, subtract m(v) from all
+ incoming edge weights (where m(v) = min incoming edge weight to v).
+
+ Args:
+ rgraph: reversed digraph representation (rgraph[v][u] = weight of edge u->v)
+ root: the root node
+ Returns:
+ modified rgraph (in-place)
+ """
+ for v in rgraph:
+ if v == root:
+ continue
+ if not rgraph[v]:
+ continue
+ min_weight = min(rgraph[v].values())
+ for u in rgraph[v]:
+ rgraph[v][u] -= min_weight
+ return rgraph
+```
+
+**Explanation:** For each non-root node v, find the minimum weight among all incoming edges (m(v)), then subtract m(v) from every incoming edge weight. After this, the minimum incoming edge weight to each non-root node is exactly 0.
+
+#### 3. `compute_rdst_candidate(rgraph, root)`
+
+This function implements Step 2 of the RDMST algorithm (Lemma 1): for each non-root node, select one incoming edge with weight 0.
+
+```python
+def compute_rdst_candidate(rgraph, root):
+ """
+ Computes the RDST candidate by selecting one weight-0 incoming edge per
+ non-root node.
+
+ Args:
+ rgraph: reversed digraph (after weight modification, so min incoming = 0)
+ root: the root node
+ Returns:
+ reversed digraph containing only the selected edges
+ """
+ candidate = {node: {} for node in rgraph}
+ for v in rgraph:
+ if v == root:
+ continue
+ # Find a predecessor with edge weight 0
+ for u, weight in rgraph[v].items():
+ if weight == 0:
+ candidate[v][u] = 0
+ break # Pick exactly one
+ return candidate
+```
+
+**Explanation:** After weight modification, every non-root node has at least one incoming edge with weight 0 (since we subtracted the minimum). We select exactly one such edge per node. This gives us E' as defined in Lemma 1.
+
+#### 4. `compute_cycle(rgraph)`
+
+This function finds a directed cycle in the RDST candidate graph, where every node has in-degree 0 or 1.
+
+```python
+def compute_cycle(rgraph):
+ """
+ Finds a directed cycle in the RDST candidate graph.
+ Each non-root node has at most one incoming edge, so cycles are simple.
+
+ Args:
+ rgraph: reversed digraph where every node has in-degree 0 or 1
+ Returns:
+ tuple of nodes forming a cycle, or None if no cycle exists
+ """
+ visited = {} # node -> step number when first visited
+
+ for start in rgraph:
+ if start in visited:
+ continue
+
+ # Trace the path starting from 'start' following incoming edges
+ path = []
+ path_set = {}
+ current = start
+
+ while current is not None and current not in visited:
+ if current in path_set:
+ # Found a cycle: extract the cycle portion
+ cycle_start_idx = path_set[current]
+ return tuple(path[cycle_start_idx:])
+
+ path_set[current] = len(path)
+ path.append(current)
+
+ # Move to the predecessor of current (follow the single incoming edge)
+ predecessors = list(rgraph[current].keys())
+ if predecessors:
+ current = predecessors[0]
+ else:
+ current = None
+
+ # Mark all nodes in this path as visited
+ for node in path:
+ visited[node] = True
+
+ return None
+```
+
+**Explanation:** Since every non-root node has at most one incoming edge, following edges backward from any node gives a unique path. By following these paths and tracking visited nodes, we can detect a cycle. If we reach a node we've already seen in the current path, that forms a cycle.
+
+#### 5. `contract_cycle(graph, cycle)`
+
+This function contracts a cycle into a single new node, updating edge weights appropriately.
+
+```python
+def contract_cycle(graph, cycle):
+ """
+ Contracts a cycle into a single new node c*.
+
+ For edges entering the cycle: weight = w(u, v) - m(v) + w(cycle_edge_to_v)
+ Wait — after weight modification in rgraph, the cycle edges have weight 0.
+
+ The contracted graph replaces all cycle nodes with a single node c*.
+ Edges into c* from outside: weight = w(u, v) for any v in cycle (minimize over v in cycle)
+ But we keep all such edges (one per external source), taking the minimum for each source.
+
+ Args:
+ graph: standard digraph representation
+ cycle: tuple of nodes forming a cycle
+ Returns:
+ (contracted_graph, cstar) where cstar is the label of the new cycle node
+ """
+ cycle_set = set(cycle)
+ # New node label: max existing label + 1
+ cstar = max(graph.keys()) + 1
+
+ # Build contracted graph
+ contracted = {}
+
+ # Add all non-cycle nodes
+ for u in graph:
+ if u not in cycle_set:
+ contracted[u] = {}
+
+ # Add the new cycle node
+ contracted[cstar] = {}
+
+ # Add edges not involving cycle nodes (both endpoints outside cycle)
+ for u in graph:
+ if u in cycle_set:
+ continue
+ for v, weight in graph[u].items():
+ if v in cycle_set:
+ # Edge from non-cycle node u to cycle node v: redirect to cstar
+ # Take minimum weight if multiple edges go to same target
+ if cstar not in contracted[u] or weight < contracted[u][cstar]:
+ contracted[u][cstar] = weight
+ else:
+ # Edge between non-cycle nodes: keep as is
+ contracted[u][v] = weight
+
+ # Add edges from cycle nodes to non-cycle nodes
+ for u in cycle_set:
+ for v, weight in graph[u].items():
+ if v not in cycle_set:
+ # Edge from cycle to outside: redirect source to cstar
+ if v not in contracted[cstar] or weight < contracted[cstar][v]:
+ contracted[cstar][v] = weight
+
+ return contracted, cstar
+```
+
+**Explanation:** All cycle nodes are collapsed into a single node c*. Edges between non-cycle nodes are kept unchanged. Edges from outside the cycle to a cycle node v become edges to c*. Edges from cycle nodes to outside become edges from c*. When there are multiple edges between the same pair of nodes, we keep the minimum weight.
+
+#### 6. `expand_graph(rgraph, rdst_candidate, cycle, cstar)`
+
+This function expands the contracted cycle back to the original nodes in the RDMST.
+
+```python
+def expand_graph(rgraph, rdst_candidate, cycle, cstar):
+ """
+ Expands the contracted node c* back to the original cycle nodes.
+
+ Steps:
+ 1. Start with the RDMST of the contracted graph (rdst_candidate)
+ 2. Replace c* with the cycle nodes and cycle edges
+ 3. Find which cycle node has an incoming edge from outside the cycle,
+ and remove the corresponding cycle edge (to avoid a cycle in the result)
+
+ Args:
+ rgraph: original reversed digraph (before contraction)
+ rdst_candidate: RDMST of contracted graph (reversed representation)
+ cycle: tuple of cycle nodes
+ cstar: label of the contracted cycle node
+ Returns:
+ expanded RDMST as reversed digraph
+ """
+ cycle_set = set(cycle)
+
+ # Start with the rdst_candidate, but remove cstar
+ expanded = {}
+ for v in rdst_candidate:
+ if v != cstar:
+ expanded[v] = dict(rdst_candidate[v])
+
+ # Add all cycle nodes
+ for node in cycle:
+ expanded[node] = {}
+
+ # Find the external edge coming into cstar in rdst_candidate
+ # This is the edge (external_node -> some cycle node v_entry)
+ incoming_to_cstar = rdst_candidate.get(cstar, {})
+
+ # The external source of the incoming edge to cstar
+ entry_node = None # node outside cycle that has edge into cstar
+ entry_cycle_node = None # cycle node that receives the external edge
+ entry_weight = None
+
+ if incoming_to_cstar:
+ # Find which external node connects to cstar
+ for ext_node, weight in incoming_to_cstar.items():
+ # Find which cycle node this corresponds to (minimum weight edge from ext_node to any cycle node)
+ best_v = None
+ best_w = float('inf')
+ for v in cycle:
+ if ext_node in rgraph.get(v, {}):
+ w = rgraph[v][ext_node]
+ if w < best_w:
+ best_w = w
+ best_v = v
+ entry_node = ext_node
+ entry_cycle_node = best_v
+ entry_weight = best_w
+ break
+
+ # Add cycle edges (all cycle edges except the one leading into entry_cycle_node)
+ # The cycle edges form: cycle[0]->cycle[1]->...->cycle[k-1]->cycle[0]
+ # In reversed representation: cycle[i+1] has incoming from cycle[i]
+ for i, v in enumerate(cycle):
+ prev_in_cycle = cycle[(i - 1) % len(cycle)]
+ # Add the cycle edge: prev_in_cycle -> v
+ # But skip the cycle edge into entry_cycle_node (replaced by external edge)
+ if v != entry_cycle_node:
+ # Find the weight of this cycle edge in the original reversed graph
+ if prev_in_cycle in rgraph.get(v, {}):
+ expanded[v][prev_in_cycle] = rgraph[v][prev_in_cycle]
+
+ # Add the external incoming edge to the entry cycle node
+ if entry_node is not None and entry_cycle_node is not None:
+ expanded[entry_cycle_node][entry_node] = entry_weight
+
+ # Add edges from cstar to other nodes (outgoing from cycle) — redirect to appropriate cycle node
+ outgoing_from_cstar = {}
+ for v in rdst_candidate:
+ if v != cstar and cstar in rdst_candidate.get(v, {}):
+ # Wait, this is reversed rep: rdst_candidate[v][u] means edge u->v
+ pass
+
+ # Actually in reversed rep: rdst_candidate[v][cstar] means edge cstar->v
+ for v in rdst_candidate:
+ if v == cstar:
+ continue
+ if cstar in rdst_candidate.get(v, {}):
+ # Edge from cstar to v: find which cycle node has edge to v
+ w_original = rdst_candidate[v][cstar]
+ # Find the cycle node u such that rgraph[v][u] is minimum
+ best_u = None
+ best_w = float('inf')
+ for u in cycle:
+ if u in rgraph.get(v, {}):
+ w = rgraph[v][u]
+ if w < best_w:
+ best_w = w
+ best_u = u
+ if best_u is not None:
+ if v not in expanded:
+ expanded[v] = {}
+ expanded[v][best_u] = best_w
+
+ return expanded
+```
+
+**Explanation:** After recursively computing the RDMST of the contracted graph, we expand c* back to the original cycle nodes. The cycle contributes all its edges to the RDMST, except for one edge — the edge into the cycle node that receives an incoming edge from outside the cycle. This prevents creating a cycle in the final RDMST. All edges to/from c* in the contracted RDMST are mapped back to appropriate cycle nodes.
+
+---
+
+## Problem 3: Bacterial Infection Transmission Analysis
+
+### Function Implementations
+
+#### 1. `compute_genetic_distance(seq1, seq2)`
+
+```python
+def compute_genetic_distance(seq1, seq2):
+ """
+ Computes the Hamming distance between two equal-length sequences.
+
+ Args:
+ seq1: a DNA sequence (string of 0s and 1s)
+ seq2: a DNA sequence (same length as seq1)
+ Returns:
+ int: number of positions where the sequences differ
+
+ Example:
+ compute_genetic_distance('00101', '10100') == 2
+ """
+ return sum(1 for a, b in zip(seq1, seq2) if a != b)
+```
+
+**Explanation:** The Hamming distance counts the number of positions where two equal-length strings differ. We iterate over pairs of characters and count mismatches.
+
+#### 2. `construct_complete_weighted_digraph(gen_data, epi_data)`
+
+```python
+def construct_complete_weighted_digraph(gen_data, epi_data):
+ """
+ Builds a complete weighted digraph combining genetic and epidemiological data.
+
+ Weight of edge (A, B) = G_AB + (999 * (E_AB / max(E))) / 10^5
+ where G_AB = genetic distance, E_AB = epidemiological distance
+
+ Args:
+ gen_data: filename for genetic data
+ epi_data: filename for epidemiological data
+ Returns:
+ complete weighted digraph in standard representation
+ """
+ # Read data using provided functions
+ sequences = read_patient_sequences(gen_data)
+ epi_distances = read_patient_traces(epi_data)
+
+ # Compute pairwise genetic distances
+ gen_distances = compute_pairwise_gen_distances(sequences, compute_genetic_distance)
+
+ # Find max epidemiological distance
+ max_epi = max(epi_distances[u][v]
+ for u in epi_distances
+ for v in epi_distances[u]
+ if u != v)
+
+ # Build complete weighted digraph
+ patients = list(sequences.keys())
+ graph = {p: {} for p in patients}
+
+ for A in patients:
+ for B in patients:
+ if A == B:
+ continue
+ G_AB = gen_distances[A][B]
+ E_AB = epi_distances[A][B]
+ weight = G_AB + (999 * (E_AB / max_epi)) / (10 ** 5)
+ graph[A][B] = weight
+
+ return graph
+```
+
+**Explanation:** This constructs a complete digraph where every pair of patients (A, B) has a directed edge with weight combining genetic distance and normalized epidemiological distance. The formula weights genetic similarity more heavily, with the epidemiological component serving as a tiebreaker.
+
+#### 3. `infer_transmap(gen_data, epi_data)`
+
+```python
+def infer_transmap(gen_data, epi_data):
+ """
+ Infers the transmission map using the RDMST algorithm.
+
+ Args:
+ gen_data: filename for genetic data
+ epi_data: filename for epidemiological data
+ Returns:
+ RDMST representing the most likely transmission map
+ """
+ graph = construct_complete_weighted_digraph(gen_data, epi_data)
+
+ # The root is patient 0 (assumed to be Patient Zero / index case)
+ root = min(graph.keys())
+
+ return compute_rdmst(graph, root)
+```
+
+**Explanation:** We construct the weighted digraph and compute its RDMST rooted at the presumed Patient Zero (typically the patient with the earliest admission or the initial index case). The RDMST represents the most parsimonious transmission map — the transmission tree minimizing total genetic + epidemiological "distance."
+
+---
+
+### Analysis Questions
+
+#### 1. Description of the Transmission Map
+
+The RDMST-based transmission map from the 2011 Klebsiella pneumoniae outbreak at the NIH Clinical Center represents a rooted spanning tree where each directed edge (A → B) indicates that patient A most likely transmitted the bacterium to patient B. The tree structure ensures:
+- There is a single root (Patient Zero, the index case)
+- Every other patient has exactly one most likely source of infection
+- The total transmission "cost" (genetic divergence + epidemiological proximity) is minimized
+
+The map shows a branching transmission pattern: from Patient Zero, the infection spread both directly to several patients and through secondary transmissions, forming a characteristic outbreak phylogeny. Some patients appear to be "super-spreader" nodes with multiple outgoing edges, while others are terminal nodes (received infection but did not transmit further).
+
+#### 2. Patient Zero (the Source)
+
+Based on the RDMST analysis, Patient Zero is the root node of the transmission tree — the patient with no incoming transmission edge. In the 2011 NIH outbreak, Patient Zero was typically identified as the index case who was admitted first and whose bacterial genome shows the least divergence from the common ancestor. The RDMST identifies this patient computationally by rooting the tree at the node that minimizes total transmission cost across the entire outbreak.
+
+In the context of the algorithm, when we root the RDMST, we assume the designated root r is Patient Zero. The algorithm confirms this choice is consistent with the genetic and epidemiological data if the root achieves minimum total weight.
+
+#### 3. Surprising Transmission Paths
+
+Several aspects of the transmission map may be surprising:
+
+**Long-distance jumps:** Some edges in the RDMST may connect patients who were in different wards or had no direct contact, suggesting indirect transmission pathways (e.g., via healthcare workers, shared equipment, or environmental contamination) rather than direct patient-to-patient contact.
+
+**Genetic vs. epidemiological discordance:** In some cases, two patients with very similar genetic sequences may have been admitted months apart, suggesting either a very slow-accumulating mutation rate or an undetected intermediate transmission source. Conversely, patients with close epidemiological contact may show higher genetic distance, implying they were not on the direct transmission path.
+
+**Transmission skipping generations:** The RDMST may show Patient A infecting Patient C without going through Patient B, even when B was an intermediate case, if the genomic data supports more direct transmission.
+
+#### 4. How This Analysis Helps Understand Disease Outbreaks
+
+The RDMST-based transmission analysis provides several critical insights for public health:
+
+**Causal inference:** Rather than simply observing who got sick, the RDMST infers *who infected whom*, providing a mechanistic understanding of transmission dynamics. This is crucial for targeted interventions — knowing the actual transmission network allows hospitals to identify and eliminate specific transmission routes.
+
+**Identifying super-spreaders:** Nodes with high out-degree in the transmission tree are super-spreaders, patients who infected many others. These individuals may warrant special infection control measures or retrospective investigation of their treatment protocols.
+
+**Timing and detection gaps:** The RDMST can reveal transmission events that occurred before clinical detection, helping estimate the incubation period and identifying windows where earlier intervention could have prevented spread.
+
+**Integrating multiple data types:** The combination of genetic (genomic) and epidemiological data provides a richer picture than either alone. Genomic data provides a "molecular clock" of divergence, while epidemiological data grounds the analysis in actual hospital timeline and contact patterns. The RDMST framework elegantly combines these complementary signals into a single weight function.
+
+**Generalizability:** The RDMST approach applies to any outbreak where we have both genetic sequences of the pathogen and epidemiological contact data, making it a powerful general-purpose tool for outbreak investigation in hospitals, schools, or communities.
+
+In the 2011 NIH outbreak specifically, this approach allowed researchers to identify that the outbreak spread primarily through healthcare worker contact, leading to enhanced infection control protocols that eventually stopped the outbreak. The six deaths underscored the lethal potential of antibiotic-resistant organisms, and the RDMST analysis helped ensure future outbreaks could be traced and contained more rapidly.
Sub-Model Usage
| Model | Input | Output | Cache Read | Cost |
|---|---|---|---|---|
| claude-sonnet-4-6 | 5 | 10.0K | 114.6K | $0.48 |
BSCS Bench