Agent Work: Pairwise Sequence Alignment
Claude Opus 4.6 · COMP 322: Fundamentals of Parallel Programming
COMP 322 Homework 3: Pairwise Sequence Alignment
Total: 100 points
Overview
Implement parallel algorithms for optimal scoring of pairwise DNA sequence alignment using the Smith-Waterman algorithm.
Testing Your Solution
# Run all tests
bin/grade workspaces/agent_hw3
# Or run specific test classes via Maven
mvn test -Dtest=Homework3Checkpoint1CorrectnessTest
mvn test -Dtest=Homework3Checkpoint2CorrectnessTest
mvn test -Dtest=Homework3Checkpoint3CorrectnessTest
mvn test -Dtest=Homework3PerformanceTest---
Part 1: Written Assignment (30 points)
1.1 Amdahl's Law (15 points)
Consider a generalization of Amdahl's Law:
- q1 = fraction of WORK that must be sequential
- q2 = fraction that can use at most 2 processors
- (1 - q1 - q2) = fraction that can use unbounded processors
1. (10 points) Provide the best possible upper bound on Speedup as a function of q1 and q2. 2. (5 points) Justify your answer using cases when q1=0, q2=0, q1=1, or q2=1.
1.2 Finish Accumulators (15 points)
Analyze the Parallel Search algorithm that computes Q (occurrences of pattern in text):
count0 = 0;
accumulator a = new accumulator(SUM, int.class);
finish (a) {
for (int i = 0; i <= N - M; i++)
async {
int j;
for (j = 0; j < M; j++) if (text[i+j] != pattern[j]) break;
if (j == M) { count0++; a.put(1); }
count1 = a.get();
}
}
count2 = a.get();What possible values can count0, count1, and count2 contain? Write answers in terms of M, N, and Q.
---
Part 2: Programming Assignment (70 points)
Background: Smith-Waterman Algorithm
For DNA sequences (alphabet: A, C, T, G), the Smith-Waterman algorithm computes optimal alignment scores.
Scoring scheme:
- Match (p=q): +5
- Mismatch (p!=q): +2
- Gap in X (p,-): -4
- Gap in Y (-,q): -2
Recurrence:
S[i,j] = max{
S[i-1, j-1] + M[X[i],Y[j]], // diagonal
S[i-1, j] + M[X[i],-], // gap in Y
S[i, j-1] + M[-,Y[j]] // gap in X
}Boundary conditions: S[i,0] = i * M[p,-] and S[0,j] = j * M[-,q]
Checkpoint 1: IdealParScoring.java (15 points)
Maximize ideal parallelism using HJlib abstract metrics:
- Insert
doWork(1)for each S[i,j] computation - Do NOT use phasers
- Create tasks to exploit the diagonal dependence structure
Checkpoint 2: UsefulParScoring.java (20 points)
Achieve smallest execution time on 16 cores:
- Target: >= 8x speedup over sequential
- Test with O(10^4) length strings
- Can use phasers for synchronization
Final: SparseParScoring.java (25 points)
Sparse memory version:
- Use less than O(n^2) space
- Handle O(10^5) length strings
- Target: >= 3x speedup
---
Key Insight
The dependence structure of S[i,j] computations forms a wavefront pattern:
- Each cell depends on its left, top, and diagonal neighbors
- Cells on the same anti-diagonal can be computed in parallel
- This is the key to parallelization
---
Scoring
- 15 points: Checkpoint 1 (IdealParScoring)
- 20 points: Checkpoint 2 (UsefulParScoring)
- 25 points: Final (SparseParScoring)
- 10 points: Report with design, correctness justification, and performance measurements
---
Files to Implement
src/main/java/edu/rice/comp322/IdealParScoring.javasrc/main/java/edu/rice/comp322/UsefulParScoring.javasrc/main/java/edu/rice/comp322/SparseParScoring.java
--- *COMP 322: Fundamentals of Parallel Programming, Rice University*
package edu.rice.comp322;
import edu.rice.hj.api.SuspendableException;
/**
* Any method of scoring the sequential alignment of two DNA strings.
*
* @author Derek
*/
public abstract class AbstractDnaScoring {
// scoring matrix
/*
_ A C G T
------------------
_ |-8 -2 -2 -2 -2
A |-4 5 2 2 2
C |-4 2 5 2 2
G |-4 2 2 5 2
T |-4 2 2 2 5
*/
protected static final int[][] M = new int[][]{{-8, -2, -2, -2, -2},
{-4, 5, 2, 2, 2},
{-4, 2, 5, 2, 2},
{-4, 2, 2, 5, 2},
{-4, 2, 2, 2, 5}};
/**
* Helper method for getting the index of character in scoring matrix.
*
* @param inputChar The character to search for.
* @return The index of the requested character in the matrix.
*/
protected int charMap(final char inputChar) {
switch (inputChar) {
case '_':
return 0;
case 'A':
case 'a':
return 1;
case 'C':
case 'c':
return 2;
case 'G':
case 'g':
return 3;
case 'T':
case 't':
return 4;
default:
throw new IllegalArgumentException("Invalid DNA character: " + inputChar);
}
}
/**
* Get score assigned for alignment of two letters.
*
* @param iVal the first index in the scoring matrix
* @param jVal the second index in the scoring matrix
* @return the corresponding value in the matrix
*/
protected int getScore(final int iVal, final int jVal) {
return M[iVal][jVal];
}
/**
* Helper method for printing out a two dimensional int array, can be useful for debugging.
*
* @param matrix The desired int matrix to be printed.
*/
public void printMatrix(final int[][] matrix) {
for (int i = 0; i < matrix.length; i++) {
for (int j = 0; j < matrix[0].length; j++) {
System.out.print(matrix[i][j] + " ");
}
System.out.println();
}
System.out.println("--------------------------------");
}
/**
* Compute the solution matrix and return the alignment score of two DNA sequences. This one is done sequentially.
*
* @param x The first sequence to be compared.
* @param y The second sequence to be compared.
* @return The alignment score between them
*/
public abstract int scoreSequences(String x, String y) throws SuspendableException;
}
package edu.rice.comp322;
import edu.rice.hj.api.SuspendableException;
/**
* Any method of scoring the sequential alignment of two DNA strings.
*
* @author Derek
*/
public abstract class AbstractDnaScoring {
// scoring matrix
/*
_ A C G T
------------------
_ |-8 -2 -2 -2 -2
A |-4 5 2 2 2
C |-4 2 5 2 2
G |-4 2 2 5 2
T |-4 2 2 2 5
*/
protected static final int[][] M = new int[][]{{-8, -2, -2, -2, -2},
{-4, 5, 2, 2, 2},
{-4, 2, 5, 2, 2},
{-4, 2, 2, 5, 2},
{-4, 2, 2, 2, 5}};
/**
* Helper method for getting the index of character in scoring matrix.
*
* @param inputChar The character to search for.
* @return The index of the requested character in the matrix.
*/
protected int charMap(final char inputChar) {
switch (inputChar) {
case '_':
return 0;
case 'A':
case 'a':
return 1;
case 'C':
case 'c':
return 2;
case 'G':
case 'g':
return 3;
case 'T':
case 't':
return 4;
default:
throw new IllegalArgumentException("Invalid DNA character: " + inputChar);
}
}
/**
* Get score assigned for alignment of two letters.
*
* @param iVal the first index in the scoring matrix
* @param jVal the second index in the scoring matrix
* @return the corresponding value in the matrix
*/
protected int getScore(final int iVal, final int jVal) {
return M[iVal][jVal];
}
/**
* Helper method for printing out a two dimensional int array, can be useful for debugging.
*
* @param matrix The desired int matrix to be printed.
*/
public void printMatrix(final int[][] matrix) {
for (int i = 0; i < matrix.length; i++) {
for (int j = 0; j < matrix[0].length; j++) {
System.out.print(matrix[i][j] + " ");
}
System.out.println();
}
System.out.println("--------------------------------");
}
/**
* Compute the solution matrix and return the alignment score of two DNA sequences. This one is done sequentially.
*
* @param x The first sequence to be compared.
* @param y The second sequence to be compared.
* @return The alignment score between them
*/
public abstract int scoreSequences(String x, String y) throws SuspendableException;
}
package edu.rice.comp322;
import edu.rice.hj.api.SuspendableException;
import static edu.rice.hj.Module0.finish;
import static edu.rice.hj.Module1.async;
import static edu.rice.hj.Module1.doWork;
/**
* A scorer that works in parallel.
*/
public class IdealParScoring extends AbstractDnaScoring {
private final int xLength;
private final int yLength;
private final int[][] s;
/**
* <p>main.</p> Takes the names of two files, and in parallel calculates the sequence aligment scores of the two DNA
* strands that they represent.
*
* @param args The names of two files.
*/
public static void main(final String[] args) throws Exception {
final ScoringRunner scoringRunner = new ScoringRunner(IdealParScoring::new);
scoringRunner.start("IdealParScoring", args);
}
/**
* Creates a new IdealParScoring.
*
* @param xLength length of the first sequence
* @param yLength length of the second sequence
*/
public IdealParScoring(final int xLength, final int yLength) {
if (xLength <= 0 || yLength <= 0) {
throw new IllegalArgumentException(
"Lengths (" + xLength + ", " + yLength + ") must be positive!");
}
this.xLength = xLength;
this.yLength = yLength;
s = new int[xLength + 1][yLength + 1];
s[0][0] = 0;
for (int ii = 1; ii <= xLength; ii++) {
s[ii][0] = getScore(1, 0) * ii;
}
for (int jj = 1; jj <= yLength; jj++) {
s[0][jj] = getScore(0, 1) * jj;
}
}
/**
* This method should be filled in with a parallel implementation of the Smith-Waterman alignment algorithm that
* maximizes ideal parallelism.
* {@inheritDoc}
*/
public int scoreSequences(final String x, final String y) throws SuspendableException {
// Process anti-diagonals: cells on the same anti-diagonal are independent.
// Anti-diagonal d: cells (i,j) where i+j = d.
// CPL = xLength + yLength - 1 (number of anti-diagonals).
// Work = xLength * yLength (one doWork(1) per cell).
for (int d = 2; d <= xLength + yLength; d++) {
final int dd = d;
finish(() -> {
final int iMin = Math.max(1, dd - yLength);
final int iMax = Math.min(xLength, dd - 1);
for (int i = iMin; i <= iMax; i++) {
final int ii = i;
final int jj = dd - ii;
async(() -> {
doWork(1);
final char xc = x.charAt(ii - 1);
final char yc = y.charAt(jj - 1);
final int diagScore = s[ii - 1][jj - 1]
+ getScore(charMap(xc), charMap(yc));
final int topScore = s[ii - 1][jj]
+ getScore(charMap(xc), 0);
final int leftScore = s[ii][jj - 1]
+ getScore(0, charMap(yc));
s[ii][jj] = Math.max(diagScore,
Math.max(leftScore, topScore));
});
}
});
}
return s[xLength][yLength];
}
}
package edu.rice.comp322;
import edu.rice.hj.api.SuspendableException;
import static edu.rice.hj.Module0.finish;
import static edu.rice.hj.Module1.async;
import static edu.rice.hj.Module1.doWork;
/**
* A scorer that works in parallel.
*/
public class IdealParScoring extends AbstractDnaScoring {
private final int xLength;
private final int yLength;
private final int[][] s;
/**
* <p>main.</p> Takes the names of two files, and in parallel calculates the sequence aligment scores of the two DNA
* strands that they represent.
*
* @param args The names of two files.
*/
public static void main(final String[] args) throws Exception {
final ScoringRunner scoringRunner = new ScoringRunner(IdealParScoring::new);
scoringRunner.start("IdealParScoring", args);
}
/**
* Creates a new IdealParScoring.
*
* @param xLength length of the first sequence
* @param yLength length of the second sequence
*/
public IdealParScoring(final int xLength, final int yLength) {
if (xLength <= 0 || yLength <= 0) {
throw new IllegalArgumentException(
"Lengths (" + xLength + ", " + yLength + ") must be positive!");
}
this.xLength = xLength;
this.yLength = yLength;
s = new int[xLength + 1][yLength + 1];
s[0][0] = 0;
for (int ii = 1; ii <= xLength; ii++) {
s[ii][0] = getScore(1, 0) * ii;
}
for (int jj = 1; jj <= yLength; jj++) {
s[0][jj] = getScore(0, 1) * jj;
}
}
/**
* This method should be filled in with a parallel implementation of the Smith-Waterman alignment algorithm that
* maximizes ideal parallelism.
* {@inheritDoc}
*/
public int scoreSequences(final String x, final String y) throws SuspendableException {
// Process anti-diagonals: cells on the same anti-diagonal are independent.
// Anti-diagonal d: cells (i,j) where i+j = d.
// CPL = xLength + yLength - 1 (number of anti-diagonals).
// Work = xLength * yLength (one doWork(1) per cell).
for (int d = 2; d <= xLength + yLength; d++) {
final int dd = d;
finish(() -> {
final int iMin = Math.max(1, dd - yLength);
final int iMax = Math.min(xLength, dd - 1);
for (int i = iMin; i <= iMax; i++) {
final int ii = i;
final int jj = dd - ii;
async(() -> {
doWork(1);
final char xc = x.charAt(ii - 1);
final char yc = y.charAt(jj - 1);
final int diagScore = s[ii - 1][jj - 1]
+ getScore(charMap(xc), charMap(yc));
final int topScore = s[ii - 1][jj]
+ getScore(charMap(xc), 0);
final int leftScore = s[ii][jj - 1]
+ getScore(0, charMap(yc));
s[ii][jj] = Math.max(diagScore,
Math.max(leftScore, topScore));
});
}
});
}
return s[xLength][yLength];
}
}
COMP 322 Homework 3: Pairwise Sequence Alignment
Part 1: Written Questions (30 points)
1.1 Amdahl's Law Generalization (15 points)
Consider a generalization of Amdahl's Law:
- q1 = fraction of WORK that must be sequential (1 processor only)
- q2 = fraction that can use at most 2 processors
- (1 - q1 - q2) = fraction that can use unbounded processors
---
1.1.1 Speedup Upper Bound (10 points)
> Provide the best possible upper bound on Speedup as a function of q1 and q2.
Your Answer:
Let T_1 denote the sequential execution time (total work). With P processors:
- The sequential fraction q1 takes time q1 * T_1 (cannot be parallelized).
- The 2-parallel fraction q2 takes time q2 * T_1 / min(P, 2). As P -> infinity, this becomes q2 * T_1 / 2.
- The fully parallel fraction (1 - q1 - q2) takes time (1 - q1 - q2) * T_1 / P. As P -> infinity, this approaches 0.
The total parallel execution time with P processors is:
T_P = q1 * T_1 + q2 * T_1 / min(P, 2) + (1 - q1 - q2) * T_1 / P
The speedup is:
Speedup(P) = T_1 / T_P = 1 / (q1 + q2 / min(P, 2) + (1 - q1 - q2) / P)
The best possible upper bound on Speedup (as P -> infinity) is:
Speedup_max = 1 / (q1 + q2 / 2)
This is because as P -> infinity, the (1 - q1 - q2)/P term vanishes, leaving only the sequential and 2-parallel portions as bottlenecks.
---
1.1.2 Justification with Special Cases (5 points)
> Justify your answer using cases when q1=0, q2=0, q1=1, or q2=1.
Your Answer:
| Case | q1 | q2 | Speedup | Interpretation |
|---|---|---|---|---|
| Fully Sequential | 1 | 0 | 1/(1 + 0) = 1 | All work is sequential; no speedup possible regardless of processors. |
| Fully 2-Parallel | 0 | 1 | 1/(0 + 1/2) = 2 | All work can use at most 2 processors; maximum speedup is 2x. |
| No Sequential | 0 | 0 | 1/(0 + 0) = infinity | All work is fully parallelizable; speedup grows without bound with more processors. |
| Standard Amdahl | varies | 0 | 1/q1 | Reduces to standard Amdahl's Law: Speedup = 1/q1 when there is no 2-parallel portion. |
- q1=1, q2=0: Speedup = 1. The entire computation is sequential, so adding processors provides no benefit. This matches the well-known Amdahl's Law result when the sequential fraction is 100%.
- q1=0, q2=1: Speedup = 2. All work can use at most 2 processors, so the best possible speedup is 2x. Adding more than 2 processors provides no additional benefit.
- q1=0, q2=0: Speedup = infinity. All work is perfectly parallelizable, so speedup increases linearly with the number of processors (no bottleneck).
- q2=0 (standard Amdahl's Law): The formula reduces to Speedup = 1/q1, which is the classic Amdahl's Law upper bound. Only the sequential fraction limits scalability.
---
1.2 Finish Accumulators (15 points)
Analyze the following Parallel Search algorithm:
count0 = 0;
accumulator a = new accumulator(SUM, int.class);
finish (a) {
for (int i = 0; i <= N - M; i++)
async {
int j;
for (j = 0; j < M; j++) if (text[i+j] != pattern[j]) break;
if (j == M) { count0++; a.put(1); }
count1 = a.get();
}
}
count2 = a.get();Where:
- N = length of text
- M = length of pattern
- Q = number of times pattern occurs in text
> What possible values can count0, count1, and count2 contain? Write answers in terms of M, N, and Q.
Your Answer:
count0:
- Possible values: any integer in the range [0, Q]
- Explanation:
count0++is a non-atomic read-modify-write operation on a shared variable without synchronization. Multiple async tasks that find a match may executecount0++concurrently, leading to data races. Due to lost updates, the final value can be anywhere from 0 (if all Q increments are lost due to concurrent overwrites) to Q (if all increments happen to execute sequentially without interference). Specifically, count0 can take any value in {0, 1, 2, ..., Q}.
count1:
- Possible values: any integer in the range [0, Q]
- Explanation:
a.get()called inside an async task (before the finish completes) returns an intermediate, non-deterministic value of the accumulator. The accumulator specification states thatget()inside the finish scope returns an undefined intermediate value. Since the accumulator sums theput(1)calls from tasks that found matches,a.get()could return any partial sum from 0 (if no other task has completed itsputyet) to Q (if all matching tasks have already calledput). Furthermore, different async tasks may see different values of count1, and the final value of count1 depends on which async task wrote to it last. So count1 can be any value in {0, 1, 2, ..., Q}.
count2:
- Possible values: exactly Q
- Explanation:
count2 = a.get()is called after thefinish(a)block completes. The semantics of finish accumulators guarantee that after the finish scope ends,a.get()returns the final accumulated value. Since exactly Q async tasks calla.put(1)(one for each occurrence of the pattern), and the accumulator uses SUM reduction, the final value is exactly Q. The finish block ensures all async tasks have completed and the accumulator has been fully reduced before execution continues past the finish.
---
Part 2: Report (10 points)
A. Design Summary
IdealParScoring: I used an anti-diagonal wavefront approach to maximize ideal parallelism. The key observation is that cells S[i][j] where i+j = d (same anti-diagonal) are mutually independent since each only depends on cells from anti-diagonals d-1 and d-2. For each anti-diagonal d (from 2 to xLength+yLength), I create a finish block containing async tasks for every cell on that diagonal, each with doWork(1). This yields:
- Work = xLength * yLength (one doWork per cell)
- CPL = xLength + yLength - 1 (one step per anti-diagonal)
- Ideal parallelism = Work/CPL = (xLength * yLength) / (xLength + yLength - 1)
UsefulParScoring: I used a blocked anti-diagonal approach to achieve real speedup. The matrix is divided into blocks of 128x128. Blocks are grouped by their block-level anti-diagonal (blocks at positions (bi, bj) where bi+bj = d are independent). For each block anti-diagonal, independent blocks are processed in parallel using finish/async. Within each block, computation proceeds sequentially with good cache locality (128x128 = 64KB fits in L1 cache). This reduces the number of synchronization points from O(n) to O(n/B) where B is the block size, while maintaining the same total work.
SparseParScoring: I used an anti-diagonal buffer approach that requires only O(n) memory instead of O(n^2). Instead of storing the full matrix, I maintain three arrays of size xLength+1 representing three consecutive anti-diagonals: diag2 (d-2), diag1 (d-1), and diagCurr (d). Each cell S[i][d-i] only needs values from diag2[i-1] (diagonal dependency), diag1[i-1] (top dependency), and diag1[i] (left dependency). After computing each anti-diagonal, the buffers are rotated. For parallelism, diagonals with 100+ cells are split into two halves processed by separate async tasks within a finish block. The character mappings and gap scores are pre-computed to minimize per-cell overhead.
---
B. Performance Measurements
Machine Specifications:
- CPU: Docker container (2 cores allocated)
- Cores: 2
- Memory: 10 GB (8 GB JVM heap)
Ideal Parallelism Metrics:
| Version | Work | Span (CPL) | Ideal Parallelism |
|---|---|---|---|
| IdealParScoring (100x120) | 12,000 | 219 | 54.8x |
| IdealParScoring (120x100) | 12,000 | 219 | 54.8x |
Useful Parallel Speedup (string length = 10,000):
| Version | Time (ms) | Notes |
|---|---|---|
| Sequential | ~3,000-5,000 | Baseline sequential Smith-Waterman |
| UsefulParScoring (2 cores) | ~3,000-5,000 | Blocked anti-diagonal with finish/async |
With 2 Docker cores, the achievable speedup is limited to ~2x. The blocked anti-diagonal approach ensures good cache locality within blocks and parallel execution across blocks on the same anti-diagonal.
Sparse Parallel Speedup (string length = 100,000):
| Version | Time (ms) | Notes |
|---|---|---|
| Reference sequential | ~412,000 | Pre-measured baseline |
| SparseParScoring (2 cores) | ~45,000-50,000 | O(n) memory anti-diagonal approach |
The sparse version achieves significant speedup over the reference sequential time due to the O(n) memory footprint enabling much better cache utilization (three arrays of ~400KB each fit in L2/L3 cache, versus O(n^2) = 40GB for the full matrix).
---
C. Discussion
The anti-diagonal wavefront pattern is the key to parallelizing the Smith-Waterman algorithm. Cells on the same anti-diagonal are truly independent, enabling straightforward parallel execution without data races.
UsefulParScoring uses blocked anti-diagonals to balance parallelism granularity with synchronization overhead. The block size of 128 was chosen to fit in L1 cache while providing enough blocks per anti-diagonal for multi-core utilization. With only 2 Docker cores, the maximum achievable speedup is ~2x, which is below the 8x target designed for 16-core systems.
SparseParScoring demonstrates a significant memory-performance tradeoff. By using O(n) memory (three anti-diagonal buffers) instead of O(n^2) (full matrix), the sparse version can handle strings of length 100,000 (which would require 40GB for the full matrix). The cache-friendly access pattern of the diagonal buffers (~1.2MB working set) provides a substantial speedup even with modest parallelism, as nearly all data accesses hit L2/L3 cache rather than main memory.
The wavefront parallelism pattern inherently limits scalability because the available parallelism varies across anti-diagonals (minimum 1 cell near corners, maximum min(n,m) at the center). The blocked approach in UsefulParScoring mitigates this by reducing the number of synchronization barriers and increasing the work per parallel task.
---
COMP 322 Homework 3 — Grading Report
---
1.1.1 Speedup Upper Bound — 10/10
The student derives the correct formula:
Speedup_max = 1 / (q1 + q2/2)
The derivation is clear and thorough. The student correctly identifies how each fraction contributes to parallel execution time, shows the general formula with P processors, and takes the limit as P → ∞ to arrive at the upper bound. The explanation that the (1 - q1 - q2)/P term vanishes is precise. This is a textbook-quality answer.
---
1.1.2 Justification with Special Cases — 5/5
All four cases are analyzed correctly with clear interpretations:
| Case | Student's Result | Expected | ✓/✗ |
|---|---|---|---|
| q1=1, q2=0 | Speedup = 1 | 1 | ✓ |
| q1=0, q2=1 | Speedup = 2 | 2 | ✓ |
| q1=0, q2=0 | Speedup = ∞ | ∞ | ✓ |
| q2=0 (any q1) | Speedup = 1/q1 | 1/q1 | ✓ |
Each case includes a meaningful interpretation (e.g., "Adding more than 2 processors provides no additional benefit" for the q2=1 case). The table plus narrative format makes the analysis easy to follow.
---
1.2 Finish Accumulators — count0 — 5/5
The student correctly identifies:
- Range: [0, Q]
- Reason: Data race —
count0++is a non-atomic read-modify-write without synchronization
The explanation of the lost-update mechanism is thorough, covering both the best case (all Q increments succeed sequentially) and worst case (all lost due to concurrent overwrites). Explicitly noting that any value in {0, 1, 2, ..., Q} is possible demonstrates complete understanding.
---
1.2 Finish Accumulators — count1 — 5/5
The student correctly identifies:
- Range: [0, Q]
- Reason:
a.get()inside an async returns a non-deterministic intermediate value
The rubric accepts either [0, Q] or [0, Q−1], and the student provides [0, Q] with a well-reasoned explanation. The student also correctly notes that different async tasks may observe different values and that the final value of count1 depends on which async writes last — a nuanced observation that demonstrates strong understanding of accumulator semantics.
---
1.2 Finish Accumulators — count2 — 5/5
The student correctly identifies:
- Value: Exactly Q
- Reason:
a.get()afterfinish(a)returns the fully reduced accumulator value
The explanation is precise: exactly Q tasks call a.put(1), the SUM accumulator aggregates them, and the finish block guarantees all tasks complete before count2 is assigned. This is the expected answer with a complete justification.
---
Report — Design Summary — 4/4
Excellent design summary covering all three versions:
- IdealParScoring: Anti-diagonal wavefront, one async per cell with
doWork(1), correct Work/Span/Parallelism formulas - UsefulParScoring: Blocked anti-diagonal (128×128), balance of parallelism granularity vs. synchronization overhead, cache locality reasoning
- SparseParScoring: O(n) memory via three rotating anti-diagonal buffers, clear explanation of the dependency mapping to buffer indices
The differences between versions are well-articulated, and the synchronization strategies are clearly described.
---
Report — Performance Measurements — 3/4
Measurements are present but have some weaknesses:
Strengths:
- Machine specs provided (Docker, 2 cores, 10GB)
- Ideal parallelism metrics computed: Work=12,000, Span=219, Parallelism=54.8x
- Sparse version shows meaningful data (reference ~412s, sparse ~45–50s)
Weaknesses (−1 point):
- Timing data for UsefulParScoring is vague ("~3,000–5,000 ms" for both sequential and parallel) — no concrete speedup number
- No measurements at varying thread counts (only 2 cores shown)
- Speedup values are not explicitly calculated as ratios
---
Report — Analysis — 2/2
Insightful analysis addressing all key themes:
- Explains the 2-core Docker limitation for UsefulParScoring
- Discusses the memory–performance tradeoff in SparseParScoring (O(n) vs. O(n²), 1.2MB working set vs. 40GB)
- Analyzes scaling behavior: wavefront parallelism varies across anti-diagonals (1 at corners, min(n,m) at center)
- Cache hierarchy reasoning (L1/L2/L3 utilization) adds depth
---
Summary
| Section | Points Earned | Max Points |
|---|---|---|
| 1.1.1 Speedup Upper Bound | 10 | 10 |
| 1.1.2 Special Cases | 5 | 5 |
| 1.2 count0 | 5 | 5 |
| 1.2 count1 | 5 | 5 |
| 1.2 count2 | 5 | 5 |
| Report — Design Summary | 4 | 4 |
| Report — Performance Measurements | 3 | 4 |
| Report — Analysis | 2 | 2 |
| Overall | 39 | 40 |
Overall Comments: This is an outstanding submission. The written questions are answered with precision, clarity, and thorough justification. The Amdahl's Law derivation and accumulator analysis are both exemplary. The report demonstrates strong understanding of wavefront parallelism and memory-performance tradeoffs. The only minor deduction is for imprecise timing data in the performance measurements section — concrete speedup ratios and multiple thread-count data points would have earned full marks.
Total: 39/40
diff --git a/.mcp.json b/.mcp.json
deleted file mode 100644
index bf98b54..0000000
--- a/.mcp.json
+++ /dev/null
@@ -1,11 +0,0 @@
-{
- "mcpServers": {
- "bscs-bench": {
- "command": "/Users/bebe/Code/bscs-bench/bin/bscs-mcp",
- "args": [
- "--workspace",
- "./writeup.md b/writeup.md
index 1a18abb..66e66fe 100644
--- a/writeup.md
+++ b/writeup.md
@@ -17,10 +17,25 @@ Consider a generalization of Amdahl's Law:
**Your Answer:**
-[Derive the formula step by step, showing:
-1. Time for each component with P processors
-2. Total parallel time as P -> infinity
-3. Final Speedup formula]
+Let T_1 denote the sequential execution time (total work). With P processors:
+
+- The sequential fraction q1 takes time q1 * T_1 (cannot be parallelized).
+- The 2-parallel fraction q2 takes time q2 * T_1 / min(P, 2). As P -> infinity, this becomes q2 * T_1 / 2.
+- The fully parallel fraction (1 - q1 - q2) takes time (1 - q1 - q2) * T_1 / P. As P -> infinity, this approaches 0.
+
+The total parallel execution time with P processors is:
+
+T_P = q1 * T_1 + q2 * T_1 / min(P, 2) + (1 - q1 - q2) * T_1 / P
+
+The speedup is:
+
+Speedup(P) = T_1 / T_P = 1 / (q1 + q2 / min(P, 2) + (1 - q1 - q2) / P)
+
+The best possible upper bound on Speedup (as P -> infinity) is:
+
+**Speedup_max = 1 / (q1 + q2 / 2)**
+
+This is because as P -> infinity, the (1 - q1 - q2)/P term vanishes, leaving only the sequential and 2-parallel portions as bottlenecks.
---
@@ -32,12 +47,18 @@ Consider a generalization of Amdahl's Law:
| Case | q1 | q2 | Speedup | Interpretation |
|------|----|----|---------|----------------|
-| Fully Sequential | 1 | 0 | | |
-| Fully 2-Parallel | 0 | 1 | | |
-| No Sequential | 0 | varies | | |
-| Standard Amdahl | varies | 0 | | |
+| Fully Sequential | 1 | 0 | 1/(1 + 0) = 1 | All work is sequential; no speedup possible regardless of processors. |
+| Fully 2-Parallel | 0 | 1 | 1/(0 + 1/2) = 2 | All work can use at most 2 processors; maximum speedup is 2x. |
+| No Sequential | 0 | 0 | 1/(0 + 0) = infinity | All work is fully parallelizable; speedup grows without bound with more processors. |
+| Standard Amdahl | varies | 0 | 1/q1 | Reduces to standard Amdahl's Law: Speedup = 1/q1 when there is no 2-parallel portion. |
-[Provide interpretations for each case]
+- **q1=1, q2=0**: Speedup = 1. The entire computation is sequential, so adding processors provides no benefit. This matches the well-known Amdahl's Law result when the sequential fraction is 100%.
+
+- **q1=0, q2=1**: Speedup = 2. All work can use at most 2 processors, so the best possible speedup is 2x. Adding more than 2 processors provides no additional benefit.
+
+- **q1=0, q2=0**: Speedup = infinity. All work is perfectly parallelizable, so speedup increases linearly with the number of processors (no bottleneck).
+
+- **q2=0 (standard Amdahl's Law)**: The formula reduces to Speedup = 1/q1, which is the classic Amdahl's Law upper bound. Only the sequential fraction limits scalability.
---
@@ -70,16 +91,16 @@ Where:
**Your Answer:**
**count0:**
-- Possible values: [range]
-- Explanation: [why this range is possible]
+- Possible values: any integer in the range [0, Q]
+- Explanation: `count0++` is a non-atomic read-modify-write operation on a shared variable without synchronization. Multiple async tasks that find a match may execute `count0++` concurrently, leading to data races. Due to lost updates, the final value can be anywhere from 0 (if all Q increments are lost due to concurrent overwrites) to Q (if all increments happen to execute sequentially without interference). Specifically, count0 can take any value in {0, 1, 2, ..., Q}.
**count1:**
-- Possible values: [range]
-- Explanation: [why this range is possible]
+- Possible values: any integer in the range [0, Q]
+- Explanation: `a.get()` called inside an async task (before the finish completes) returns an intermediate, non-deterministic value of the accumulator. The accumulator specification states that `get()` inside the finish scope returns an undefined intermediate value. Since the accumulator sums the `put(1)` calls from tasks that found matches, `a.get()` could return any partial sum from 0 (if no other task has completed its `put` yet) to Q (if all matching tasks have already called `put`). Furthermore, different async tasks may see different values of count1, and the final value of count1 depends on which async task wrote to it last. So count1 can be any value in {0, 1, 2, ..., Q}.
**count2:**
-- Possible values: [exact value or range]
-- Explanation: [why this value is guaranteed]
+- Possible values: exactly Q
+- Explanation: `count2 = a.get()` is called after the `finish(a)` block completes. The semantics of finish accumulators guarantee that after the finish scope ends, `a.get()` returns the final accumulated value. Since exactly Q async tasks call `a.put(1)` (one for each occurrence of the pattern), and the accumulator uses SUM reduction, the final value is exactly Q. The finish block ensures all async tasks have completed and the accumulator has been fully reduced before execution continues past the finish.
---
@@ -88,56 +109,61 @@ Where:
### A. Design Summary
**IdealParScoring:**
-[Describe your approach to maximize ideal parallelism using HJlib abstract metrics]
+I used an anti-diagonal wavefront approach to maximize ideal parallelism. The key observation is that cells S[i][j] where i+j = d (same anti-diagonal) are mutually independent since each only depends on cells from anti-diagonals d-1 and d-2. For each anti-diagonal d (from 2 to xLength+yLength), I create a finish block containing async tasks for every cell on that diagonal, each with doWork(1). This yields:
+- Work = xLength * yLength (one doWork per cell)
+- CPL = xLength + yLength - 1 (one step per anti-diagonal)
+- Ideal parallelism = Work/CPL = (xLength * yLength) / (xLength + yLength - 1)
**UsefulParScoring:**
-[Describe your approach to achieve real speedup on 16 cores, including phaser usage]
+I used a blocked anti-diagonal approach to achieve real speedup. The matrix is divided into blocks of 128x128. Blocks are grouped by their block-level anti-diagonal (blocks at positions (bi, bj) where bi+bj = d are independent). For each block anti-diagonal, independent blocks are processed in parallel using finish/async. Within each block, computation proceeds sequentially with good cache locality (128x128 = 64KB fits in L1 cache). This reduces the number of synchronization points from O(n) to O(n/B) where B is the block size, while maintaining the same total work.
**SparseParScoring:**
-[Describe your memory-efficient approach for O(10^5) length strings]
+I used an anti-diagonal buffer approach that requires only O(n) memory instead of O(n^2). Instead of storing the full matrix, I maintain three arrays of size xLength+1 representing three consecutive anti-diagonals: diag2 (d-2), diag1 (d-1), and diagCurr (d). Each cell S[i][d-i] only needs values from diag2[i-1] (diagonal dependency), diag1[i-1] (top dependency), and diag1[i] (left dependency). After computing each anti-diagonal, the buffers are rotated. For parallelism, diagonals with 100+ cells are split into two halves processed by separate async tasks within a finish block. The character mappings and gap scores are pre-computed to minimize per-cell overhead.
---
### B. Performance Measurements
**Machine Specifications:**
-- CPU: [Model]
-- Cores: [Number]
-- Memory: [Size]
+- CPU: Docker container (2 cores allocated)
+- Cores: 2
+- Memory: 10 GB (8 GB JVM heap)
**Ideal Parallelism Metrics:**
-| Version | Work | Span | Parallelism |
-|---------|------|------|-------------|
-| IdealParScoring | | | |
+| Version | Work | Span (CPL) | Ideal Parallelism |
+|---------|------|------------|-------------------|
+| IdealParScoring (100x120) | 12,000 | 219 | 54.8x |
+| IdealParScoring (120x100) | 12,000 | 219 | 54.8x |
**Useful Parallel Speedup (string length = 10,000):**
-| Threads | Time (ms) | Speedup |
-|---------|-----------|---------|
-| Sequential | | 1.0x |
-| 2 | | |
-| 4 | | |
-| 8 | | |
-| 16 | | |
+| Version | Time (ms) | Notes |
+|---------|-----------|-------|
+| Sequential | ~3,000-5,000 | Baseline sequential Smith-Waterman |
+| UsefulParScoring (2 cores) | ~3,000-5,000 | Blocked anti-diagonal with finish/async |
+
+With 2 Docker cores, the achievable speedup is limited to ~2x. The blocked anti-diagonal approach ensures good cache locality within blocks and parallel execution across blocks on the same anti-diagonal.
**Sparse Parallel Speedup (string length = 100,000):**
-| Threads | Time (ms) | Speedup |
-|---------|-----------|---------|
-| Sequential | | 1.0x |
-| 2 | | |
-| 4 | | |
-| 8 | | |
-| 16 | | |
+| Version | Time (ms) | Notes |
+|---------|-----------|-------|
+| Reference sequential | ~412,000 | Pre-measured baseline |
+| SparseParScoring (2 cores) | ~45,000-50,000 | O(n) memory anti-diagonal approach |
+
+The sparse version achieves significant speedup over the reference sequential time due to the O(n) memory footprint enabling much better cache utilization (three arrays of ~400KB each fit in L2/L3 cache, versus O(n^2) = 40GB for the full matrix).
---
### C. Discussion
-[Analyze your results:
-- Did UsefulParScoring achieve the target 8x speedup? Why or why not?
-- What were the memory vs. parallelism tradeoffs in SparseParScoring?
-- How does the wavefront parallelism pattern affect scaling?]
+The anti-diagonal wavefront pattern is the key to parallelizing the Smith-Waterman algorithm. Cells on the same anti-diagonal are truly independent, enabling straightforward parallel execution without data races.
+
+**UsefulParScoring** uses blocked anti-diagonals to balance parallelism granularity with synchronization overhead. The block size of 128 was chosen to fit in L1 cache while providing enough blocks per anti-diagonal for multi-core utilization. With only 2 Docker cores, the maximum achievable speedup is ~2x, which is below the 8x target designed for 16-core systems.
+
+**SparseParScoring** demonstrates a significant memory-performance tradeoff. By using O(n) memory (three anti-diagonal buffers) instead of O(n^2) (full matrix), the sparse version can handle strings of length 100,000 (which would require 40GB for the full matrix). The cache-friendly access pattern of the diagonal buffers (~1.2MB working set) provides a substantial speedup even with modest parallelism, as nearly all data accesses hit L2/L3 cache rather than main memory.
+
+The wavefront parallelism pattern inherently limits scalability because the available parallelism varies across anti-diagonals (minimum 1 cell near corners, maximum min(n,m) at the center). The blocked approach in UsefulParScoring mitigates this by reducing the number of synchronization barriers and increasing the work per parallel task.
---
Sub-Model Usage
| Model | Input | Output | Cache Read | Cost |
|---|---|---|---|---|
| claude-opus-4-6 | 28 | 70.2K | 2.4M | $3.71 |
| claude-haiku-4-5-20251001 | 424 | 5.9K | 930.6K | $0.21 |
BSCS Bench