Agent Work: Exploring Baby Names
Claude Sonnet 4.6 · COMP 341: Practical Machine Learning
Homework 1: Exploring Baby Names
Overview
Analyze baby name data from the US Social Security Agency (SSA) to explore naming trends and see if we can predict a person's age given only their first name.
Data: SSA provides yearly files (1880-2024) with baby names, sex, and counts.
- Format:
name,sex,count(no header) - Files:
yob1880.txtthroughyob2024.txt - Location:
data/hw1/(symlinked to workspace)
Tasks
Part 0: Read in Data (5 pts)
Load all yearly files into a single pandas DataFrame with columns:
name: Baby name (string)sex: 'M' or 'F' (string)count: Number of babies with this name that year (int)year: Year of birth (int)
Expected: ~2 million rows total
Part 1: Sanity Checks (3 pts)
Verify your DataFrame:
- Check dimensions (should have ~2,052,781 rows)
- Look at top and bottom rows
Part 2: Search for General Patterns (10 pts)
Plot total births per year with different colors for M/F.
- Clear axis labels
- Appropriately sized figure
- Readable tick labels
Written Question: Do you notice any interesting patterns across years? Do they relate to historical events?
Part 3: Disentangling the Sexes (6 pts)
Investigate boys named Ruth:
- Plot boys named Ruth over time
- Find which year had the most boys named Ruth
Written Question: Do you think Babe Ruth (baseball player) had any influence on boys named Ruth?
Part 4: Oldest and Youngest Names (50 pts)
1. Filter to (name, sex) pairs with ≥20,000 total occurrences 2. Calculate how many names were filtered out 3. Calculate median and mean birth year for each name 4. Find top 10 oldest and youngest names for each sex 5. Plot trends for top 5 oldest and youngest names
"Oldest" = lowest median year (popular longest ago) "Youngest" = highest median year (most recently popular)
Written Question: Can birth rate influence which names appear oldest/youngest? Is there a more robust approach?
Part 5: Making Predictions (11 pts)
Predict ages using: age = current_year - median_birth_year
Predict ages for:
- Men: Gerald, Kai, Spencer, Jeffrey
- Women: Madison, Katherine, Anna, Simone
Written Questions:
- Do predictions match your experience meeting people with these names?
- What difference does median vs mean vs mode make?
Functions to Implement
def load_baby_names(data_dir: str) -> pd.DataFrame:
"""Load all SSA baby name files into a single DataFrame."""
def filter_by_count(df: pd.DataFrame, threshold: int = 20000) -> pd.DataFrame:
"""Filter to (name, sex) pairs with total count >= threshold."""
def calculate_median_year(df: pd.DataFrame, name: str, sex: str) -> int:
"""Calculate weighted median birth year for a name."""
def calculate_mean_year(df: pd.DataFrame, name: str, sex: str) -> float:
"""Calculate weighted mean birth year for a name."""
def get_oldest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get n oldest names (lowest median year) for a sex."""
def get_youngest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get n youngest names (highest median year) for a sex."""
def predict_age(df: pd.DataFrame, name: str, sex: str, current_year: int = 2024) -> int:
"""Predict age from median birth year."""Hints
- Use
glob.glob()to find allyob*.txtfiles - Extract year from filename:
yob1880.txt→1880 - Use
pd.concat()to combine DataFrames - For weighted median:
np.repeat(years, counts)thennp.median() - For weighted mean:
sum(year * count) / sum(count)
Grading
| Part | Points |
|---|---|
| Style & Flow | 15 |
| Part 0: Load Data | 5 |
| Part 1: Sanity Checks | 3 |
| Part 2: Patterns | 10 |
| Part 3: Ruth | 6 |
| Part 4: Oldest/Youngest | 50 |
| Part 5: Predictions | 11 |
| Total | 100 |
Extra Credit (up to 5 pts)
Propose and demonstrate a better metric for predicting age from names.
"""
COMP 341 Homework 1: Exploring Baby Names
Analyze SSA baby name data to explore naming trends and predict ages.
Functions to implement:
- load_baby_names: Load all yearly files into a single DataFrame
- filter_by_count: Filter to names with >= threshold total occurrences
- calculate_median_year: Weighted median birth year for a name
- calculate_mean_year: Weighted mean birth year for a name
- get_oldest_names: Names popular longest ago (lowest median year)
- get_youngest_names: Names most recently popular (highest median year)
- predict_age: Predict age based on median birth year
"""
import pandas as pd
import numpy as np
import glob
from pathlib import Path
from typing import Union
def load_baby_names(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load all SSA baby name files into a single DataFrame.
The SSA provides yearly files named yobYYYY.txt with format:
name,sex,count (no header)
Args:
data_dir: Path to directory containing yobYYYY.txt files
Returns:
DataFrame with columns: name, sex, count, year
- name: Baby name (str)
- sex: 'M' or 'F' (str)
- count: Number of babies with this name that year (int)
- year: Year of birth (int)
"""
# TODO: Implement this function
# Hint: Use glob to find all yob*.txt files
# Hint: Extract year from filename (e.g., yob1880.txt -> 1880)
# Hint: pd.concat() to combine all DataFrames
pass
def filter_by_count(df: pd.DataFrame, threshold: int = 20000) -> pd.DataFrame:
"""Filter to (name, sex) pairs with total count >= threshold.
This removes uncommon names that don't have enough data for
reliable statistical analysis.
Args:
df: Baby names DataFrame with columns [name, sex, count, year]
threshold: Minimum total occurrences across all years
Returns:
Filtered DataFrame containing only rows for (name, sex) pairs
that have at least `threshold` total occurrences
"""
# TODO: Implement this function
# Hint: Group by (name, sex), sum counts, filter, then merge back
pass
def calculate_median_year(df: pd.DataFrame, name: str, sex: str) -> int:
"""Calculate the weighted median birth year for a specific name.
The median is weighted by count - each baby counts as one observation.
For example, if 100 babies named "Alice" were born in 1990 and
50 in 2000, the median would be closer to 1990.
Args:
df: Baby names DataFrame
name: Name to analyze
sex: Sex ('M' or 'F')
Returns:
Median birth year as integer (rounded)
"""
# TODO: Implement this function
# Hint: np.repeat() can expand years by their counts
# Hint: np.median() on the expanded array
pass
def calculate_mean_year(df: pd.DataFrame, name: str, sex: str) -> float:
"""Calculate the weighted mean birth year for a specific name.
Args:
df: Baby names DataFrame
name: Name to analyze
sex: Sex ('M' or 'F')
Returns:
Mean birth year as float
"""
# TODO: Implement this function
# Hint: Weighted mean = sum(year * count) / sum(count)
pass
def calculate_name_stats(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate median year, mean year, and total count for all names.
Args:
df: Baby names DataFrame (ideally already filtered)
Returns:
DataFrame with columns: name, sex, median_year, mean_year, total_count
Sorted by median_year (ascending), then mean_year, then total_count
"""
# TODO: Implement this function
# This combines the above calculations for all (name, sex) pairs
pass
def get_oldest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get the n oldest names for a given sex.
"Oldest" means lowest median year - names that were popular longest ago.
Args:
df: Baby names DataFrame
sex: 'M' or 'F'
n: Number of names to return
Returns:
DataFrame with top n oldest names, including median_year
"""
# TODO: Implement this function
pass
def get_youngest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get the n youngest names for a given sex.
"Youngest" means highest median year - names most recently popular.
Args:
df: Baby names DataFrame
sex: 'M' or 'F'
n: Number of names to return
Returns:
DataFrame with top n youngest names, including median_year
"""
# TODO: Implement this function
pass
def predict_age(df: pd.DataFrame, name: str, sex: str,
current_year: int = 2024) -> int:
"""Predict a person's age based on their name's median birth year.
This is a simple heuristic: age = current_year - median_birth_year
Args:
df: Baby names DataFrame
name: Person's name
sex: 'M' or 'F'
current_year: Year to calculate age from (default 2024)
Returns:
Predicted age as integer
"""
# TODO: Implement this function
pass
# =============================================================================
# Helper functions for visualization (optional to implement)
# =============================================================================
def get_births_per_year(df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate total births per year by sex.
Args:
df: Baby names DataFrame
Returns:
DataFrame with columns: year, sex, total_births
"""
# TODO: Implement if needed for plotting
pass
def get_name_trend(df: pd.DataFrame, name: str, sex: str) -> pd.DataFrame:
"""Get yearly birth counts for a specific name.
Args:
df: Baby names DataFrame
name: Name to track
sex: Sex ('M' or 'F')
Returns:
DataFrame with columns: year, count
"""
# TODO: Implement if needed for plotting
pass
"""
COMP 341 Homework 1: Exploring Baby Names
Analyze SSA baby name data to explore naming trends and predict ages.
Functions to implement:
- load_baby_names: Load all yearly files into a single DataFrame
- filter_by_count: Filter to names with >= threshold total occurrences
- calculate_median_year: Weighted median birth year for a name
- calculate_mean_year: Weighted mean birth year for a name
- get_oldest_names: Names popular longest ago (lowest median year)
- get_youngest_names: Names most recently popular (highest median year)
- predict_age: Predict age based on median birth year
"""
import pandas as pd
import numpy as np
import glob
from pathlib import Path
from typing import Union
def load_baby_names(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load all SSA baby name files into a single DataFrame.
The SSA provides yearly files named yobYYYY.txt with format:
name,sex,count (no header)
Args:
data_dir: Path to directory containing yobYYYY.txt files
Returns:
DataFrame with columns: name, sex, count, year
- name: Baby name (str)
- sex: 'M' or 'F' (str)
- count: Number of babies with this name that year (int)
- year: Year of birth (int)
"""
# TODO: Implement this function
# Hint: Use glob to find all yob*.txt files
# Hint: Extract year from filename (e.g., yob1880.txt -> 1880)
# Hint: pd.concat() to combine all DataFrames
pass
def filter_by_count(df: pd.DataFrame, threshold: int = 20000) -> pd.DataFrame:
"""Filter to (name, sex) pairs with total count >= threshold.
This removes uncommon names that don't have enough data for
reliable statistical analysis.
Args:
df: Baby names DataFrame with columns [name, sex, count, year]
threshold: Minimum total occurrences across all years
Returns:
Filtered DataFrame containing only rows for (name, sex) pairs
that have at least `threshold` total occurrences
"""
# TODO: Implement this function
# Hint: Group by (name, sex), sum counts, filter, then merge back
pass
def calculate_median_year(df: pd.DataFrame, name: str, sex: str) -> int:
"""Calculate the weighted median birth year for a specific name.
The median is weighted by count - each baby counts as one observation.
For example, if 100 babies named "Alice" were born in 1990 and
50 in 2000, the median would be closer to 1990.
Args:
df: Baby names DataFrame
name: Name to analyze
sex: Sex ('M' or 'F')
Returns:
Median birth year as integer (rounded)
"""
# TODO: Implement this function
# Hint: np.repeat() can expand years by their counts
# Hint: np.median() on the expanded array
pass
def calculate_mean_year(df: pd.DataFrame, name: str, sex: str) -> float:
"""Calculate the weighted mean birth year for a specific name.
Args:
df: Baby names DataFrame
name: Name to analyze
sex: Sex ('M' or 'F')
Returns:
Mean birth year as float
"""
# TODO: Implement this function
# Hint: Weighted mean = sum(year * count) / sum(count)
pass
def calculate_name_stats(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate median year, mean year, and total count for all names.
Args:
df: Baby names DataFrame (ideally already filtered)
Returns:
DataFrame with columns: name, sex, median_year, mean_year, total_count
Sorted by median_year (ascending), then mean_year, then total_count
"""
# TODO: Implement this function
# This combines the above calculations for all (name, sex) pairs
pass
def get_oldest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get the n oldest names for a given sex.
"Oldest" means lowest median year - names that were popular longest ago.
Args:
df: Baby names DataFrame
sex: 'M' or 'F'
n: Number of names to return
Returns:
DataFrame with top n oldest names, including median_year
"""
# TODO: Implement this function
pass
def get_youngest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get the n youngest names for a given sex.
"Youngest" means highest median year - names most recently popular.
Args:
df: Baby names DataFrame
sex: 'M' or 'F'
n: Number of names to return
Returns:
DataFrame with top n youngest names, including median_year
"""
# TODO: Implement this function
pass
def predict_age(df: pd.DataFrame, name: str, sex: str,
current_year: int = 2024) -> int:
"""Predict a person's age based on their name's median birth year.
This is a simple heuristic: age = current_year - median_birth_year
Args:
df: Baby names DataFrame
name: Person's name
sex: 'M' or 'F'
current_year: Year to calculate age from (default 2024)
Returns:
Predicted age as integer
"""
# TODO: Implement this function
pass
# =============================================================================
# Helper functions for visualization (optional to implement)
# =============================================================================
def get_births_per_year(df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate total births per year by sex.
Args:
df: Baby names DataFrame
Returns:
DataFrame with columns: year, sex, total_births
"""
# TODO: Implement if needed for plotting
pass
def get_name_trend(df: pd.DataFrame, name: str, sex: str) -> pd.DataFrame:
"""Get yearly birth counts for a specific name.
Args:
df: Baby names DataFrame
name: Name to track
sex: Sex ('M' or 'F')
Returns:
DataFrame with columns: year, count
"""
# TODO: Implement if needed for plotting
pass
"""
COMP 341 Homework 1: Exploring Baby Names
Analyze SSA baby name data to explore naming trends and predict ages.
Functions to implement:
- load_baby_names: Load all yearly files into a single DataFrame
- filter_by_count: Filter to names with >= threshold total occurrences
- calculate_median_year: Weighted median birth year for a name
- calculate_mean_year: Weighted mean birth year for a name
- get_oldest_names: Names popular longest ago (lowest median year)
- get_youngest_names: Names most recently popular (highest median year)
- predict_age: Predict age based on median birth year
"""
import pandas as pd
import numpy as np
import glob
from pathlib import Path
from typing import Union
def load_baby_names(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load all SSA baby name files into a single DataFrame.
The SSA provides yearly files named yobYYYY.txt with format:
name,sex,count (no header)
Args:
data_dir: Path to directory containing yobYYYY.txt files
Returns:
DataFrame with columns: name, sex, count, year
- name: Baby name (str)
- sex: 'M' or 'F' (str)
- count: Number of babies with this name that year (int)
- year: Year of birth (int)
"""
files = glob.glob(str(Path(data_dir) / "yob*.txt"))
dfs = []
for f in files:
year = int(Path(f).stem[3:]) # "yob1880" -> 1880
df = pd.read_csv(f, header=None, names=["name", "sex", "count"])
df["year"] = year
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
def filter_by_count(df: pd.DataFrame, threshold: int = 20000) -> pd.DataFrame:
"""Filter to (name, sex) pairs with total count >= threshold.
This removes uncommon names that don't have enough data for
reliable statistical analysis.
Args:
df: Baby names DataFrame with columns [name, sex, count, year]
threshold: Minimum total occurrences across all years
Returns:
Filtered DataFrame containing only rows for (name, sex) pairs
that have at least `threshold` total occurrences
"""
totals = df.groupby(["name", "sex"])["count"].sum()
valid = totals[totals >= threshold].index
mask = pd.MultiIndex.from_arrays([df["name"], df["sex"]]).isin(valid)
return df[mask].reset_index(drop=True)
def calculate_median_year(df: pd.DataFrame, name: str, sex: str) -> int:
"""Calculate the weighted median birth year for a specific name.
The median is weighted by count - each baby counts as one observation.
For example, if 100 babies named "Alice" were born in 1990 and
50 in 2000, the median would be closer to 1990.
Args:
df: Baby names DataFrame
name: Name to analyze
sex: Sex ('M' or 'F')
Returns:
Median birth year as integer (rounded)
"""
subset = df[(df["name"] == name) & (df["sex"] == sex)]
years = np.repeat(subset["year"].values, subset["count"].values)
return int(np.median(years))
def calculate_mean_year(df: pd.DataFrame, name: str, sex: str) -> float:
"""Calculate the weighted mean birth year for a specific name.
Args:
df: Baby names DataFrame
name: Name to analyze
sex: Sex ('M' or 'F')
Returns:
Mean birth year as float
"""
subset = df[(df["name"] == name) & (df["sex"] == sex)]
return float((subset["year"] * subset["count"]).sum() / subset["count"].sum())
def calculate_name_stats(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate median year, mean year, and total count for all names.
Args:
df: Baby names DataFrame (ideally already filtered)
Returns:
DataFrame with columns: name, sex, median_year, mean_year, total_count
Sorted by median_year (ascending), then mean_year, then total_count
"""
def weighted_median(group):
years = np.repeat(group["year"].values, group["count"].values)
return int(np.median(years))
def weighted_mean(group):
return float((group["year"] * group["count"]).sum() / group["count"].sum())
grouped = df.groupby(["name", "sex"])
stats = pd.DataFrame({
"median_year": grouped.apply(weighted_median),
"mean_year": grouped.apply(weighted_mean),
"total_count": grouped["count"].sum(),
}).reset_index()
return stats.sort_values(
["median_year", "mean_year", "total_count"]
).reset_index(drop=True)
def get_oldest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get the n oldest names for a given sex.
"Oldest" means lowest median year - names that were popular longest ago.
Args:
df: Baby names DataFrame
sex: 'M' or 'F'
n: Number of names to return
Returns:
DataFrame with top n oldest names, including median_year
"""
filtered = filter_by_count(df)
stats = calculate_name_stats(filtered)
sex_stats = stats[stats["sex"] == sex]
return sex_stats.nsmallest(n, "median_year").reset_index(drop=True)
def get_youngest_names(df: pd.DataFrame, sex: str, n: int = 10) -> pd.DataFrame:
"""Get the n youngest names for a given sex.
"Youngest" means highest median year - names most recently popular.
Args:
df: Baby names DataFrame
sex: 'M' or 'F'
n: Number of names to return
Returns:
DataFrame with top n youngest names, including median_year
"""
filtered = filter_by_count(df)
stats = calculate_name_stats(filtered)
sex_stats = stats[stats["sex"] == sex]
return sex_stats.nlargest(n, "median_year").reset_index(drop=True)
def predict_age(df: pd.DataFrame, name: str, sex: str,
current_year: int = 2024) -> int:
"""Predict a person's age based on their name's median birth year.
This is a simple heuristic: age = current_year - median_birth_year
Args:
df: Baby names DataFrame
name: Person's name
sex: 'M' or 'F'
current_year: Year to calculate age from (default 2024)
Returns:
Predicted age as integer
"""
median_year = calculate_median_year(df, name, sex)
return current_year - median_year
# =============================================================================
# Helper functions for visualization (optional to implement)
# =============================================================================
def get_births_per_year(df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate total births per year by sex.
Args:
df: Baby names DataFrame
Returns:
DataFrame with columns: year, sex, total_births
"""
result = df.groupby(["year", "sex"])["count"].sum().reset_index()
result.columns = ["year", "sex", "total_births"]
return result
def get_name_trend(df: pd.DataFrame, name: str, sex: str) -> pd.DataFrame:
"""Get yearly birth counts for a specific name.
Args:
df: Baby names DataFrame
name: Name to track
sex: Sex ('M' or 'F')
Returns:
DataFrame with columns: year, count
"""
subset = df[(df["name"] == name) & (df["sex"] == sex)][["year", "count"]]
return subset.sort_values("year").reset_index(drop=True)
COMP 341 Homework 1: Exploring Baby Names
Written Questions
Complete each question below. Your answers should be thoughtful and reference your analysis results.
---
Part 2: Search for General Patterns
Question 1
> Do you notice any interesting patterns across years? Do they relate to historical events?
Your Answer:
The total births per year show several striking patterns. There is a sharp increase in births starting around 1946 and peaking in the late 1950s—this is the well-known "Baby Boom" following World War II, when returning soldiers started families. A visible dip occurs around 1918-1919, likely reflecting the casualties and disruptions of WWI and the 1918 influenza pandemic. Another dip appears during the early 1930s, corresponding to the Great Depression, when economic hardship reduced family sizes. More recently, there is a decline in births after 2007, coinciding with the 2008 financial crisis. Female births consistently track closely with male births throughout history, though male births are slightly higher across most years.
---
Part 3: Disentangling the Sexes
Question 2
> Do you think the famous baseball player, Babe Ruth (born 1895, died 1948) had any influence on boys named Ruth? Comment on this.
Your Answer:
Looking at the trend of boys named Ruth, there is a notable spike in the early 1900s through the 1920s, which aligns closely with Babe Ruth's rise to fame (he became a national celebrity in the late 1910s and dominated the 1920s). It seems plausible that Babe Ruth's celebrity status influenced parents to name their sons Ruth, as naming children after famous athletes was a cultural practice of that era. However, "Ruth" was already somewhat in use for boys before Ruth's career peaked, so he likely amplified an existing (though rare) trend rather than creating it entirely. After Babe Ruth's era, the name Ruth for boys essentially disappears, suggesting his influence was a key driver of what was always a small but notable phenomenon.
---
Part 4: Oldest and Youngest Names
Question 3
> Recall the plot we made earlier showing total births over time. Do you think that the birth rate can influence the oldest and youngest names we identified? If so, is there a more robust way to choose the "oldest" and "youngest" names?
Your Answer:
Yes, birth rate absolutely influences the oldest and youngest names. During the Baby Boom (1946–1964), there were far more births than in earlier or later decades. This means names that were popular during that era will have enormous raw counts pulling their weighted median toward those years—even if those names were also used in earlier eras. Conversely, names that were common in the 1880s–1920s had far fewer births recorded (both because the population was smaller and because SSA records are less complete for early years), so their weighted medians skew toward more recent decades than their true peak popularity.
A more robust approach would be to normalize each year's counts by the total births that year before computing statistics. This converts raw counts to proportions (i.e., what fraction of all babies in a given year had that name), which removes the confounding effect of overall birth rate fluctuations. Then the weighted median of proportions—or perhaps the year of peak proportion—would better reflect when a name was truly most fashionable relative to the population.
---
Part 5: Making Predictions
Question 4
> Do the predicted ages match what you have experienced in meeting people with these names? Would you rely on our predictions for guessing the age of people? Why or why not?
Your Answer:
The predictions are broadly plausible. For example, Gerald and Jeffrey have high predicted ages (60s–70s), which matches real-world experience—those names are strongly associated with older generations. Kai and Madison have much lower predicted ages (teens–30s), consistent with them being modern names that became popular in recent decades. Simone also aligns with a relatively younger prediction given its recent surge in popularity.
That said, I would not rely heavily on these predictions for individuals. Name popularity distributions are often multimodal—some names have had multiple waves of popularity across different generations—so the median year can fall in a trough between peaks and misrepresent the actual distribution. Additionally, these predictions reflect cohort trends, not individual identity. Someone named Katherine could be anywhere from 5 to 90 years old. The predictions are useful as statistical tendencies across large populations but are too noisy to apply to a single person.
---
Question 5
> Based on your understanding of the data and means versus medians, what difference does choosing "old" and "young" names using median year or mean year make? What if you chose based on mode (the total births) or another metric? How do you think these choices would affect the results?
Your Answer:
Median year and mean year often agree for names with roughly symmetric popularity distributions, but they diverge for names with skewed histories. A name like "Mary" that was extremely popular in the early 1900s and has declined steadily will have its weighted mean pulled toward those early peak years more strongly than its median, since the mean is sensitive to the magnitude of counts at each year. The median is more robust to such outliers and reflects the "typical" birth year.
Using mode (the year with the peak total birth count for a given name) would give a different ranking still—it identifies when a name was at its absolute height of popularity, ignoring the overall shape of the distribution. A name could have a high mode count in a recent year due to a brief but intense fashion trend, even if most people with that name were born decades earlier.
A more nuanced metric might be the interquartile range or the year at which cumulative births cross 50%—essentially a different formulation of the median. Using proportional popularity (counts normalized by total births per year) combined with median would be more robust. Each metric captures a different aspect of "when a name was popular," so the choice should depend on the question being asked: absolute popularity (mode), central tendency (mean/median), or relative trendiness (normalized proportional measures).
COMP 341 Homework 1: Baby Names — Written Questions Grading
---
Question 1: Birth Patterns and Historical Events (5 pts)
Score: 5/5
Excellent response. The student identifies multiple specific patterns with strong historical connections:
- Baby Boom (~1946, peaking late 1950s) tied to WWII soldiers returning
- Dip around 1918-1919 tied to WWI and the 1918 influenza pandemic
- Dip during early 1930s tied to the Great Depression
- Decline after 2007 tied to the 2008 financial crisis
- Male births consistently slightly higher than female
This exceeds the rubric requirement of "at least 2 specific patterns with plausible historical connections." The student identifies 4 distinct patterns, each with a concrete historical explanation. The observation about the 1918 flu pandemic goes beyond the reference solution's expected insights.
---
Question 2: Babe Ruth's Influence (5 pts)
Score: 5/5
Strong, nuanced answer that connects data to Babe Ruth's timeline:
- Notes the spike in the early 1900s through the 1920s aligning with Ruth's rise to fame
- Correctly places Ruth's peak celebrity in the late 1910s–1920s
- Provides a balanced take: Ruth "amplified an existing (though rare) trend rather than creating it entirely"
- Notes the name essentially disappears for boys after Ruth's era
The answer demonstrates data-driven reasoning and avoids oversimplification. The student doesn't cite a specific peak year with an exact count (e.g., "peaked in 1920 with X boys"), but the temporal specificity and causal reasoning are strong enough for full credit.
---
Question 3: Birth Rate Bias and Alternatives (5 pts)
Score: 5/5
Excellent response that clearly addresses both required components:
Bias explanation: The student explains how the Baby Boom's higher raw counts pull weighted medians toward those years, and how earlier eras with fewer recorded births cause their medians to skew. They also note the confound of incomplete early SSA records — a thoughtful addition.
Alternative metric: Proposes normalizing counts by total births per year (proportions), then computing weighted median of proportions or year of peak proportion. This directly matches the rubric's expected "normalize by total births per year" approach.
Both components are addressed with specificity and clarity.
---
Question 4: Prediction Reliability (5 pts)
Score: 5/5
The student discusses multiple well-articulated limitations: 1. Multimodal distributions — names can have multiple waves of popularity, so the median can fall in a trough between peaks 2. Cohort vs. individual — predictions reflect population-level tendencies, not individual identity ("Katherine could be anywhere from 5 to 90") 3. Statistical noise — predictions are "too noisy to apply to a single person"
The student also provides concrete examples from their predictions (Gerald/Jeffrey → 60s–70s, Kai/Madison → teens–30s) demonstrating engagement with the actual results. This meets the rubric's requirement of "at least 2 limitations with specific reasoning."
---
Question 5: Median vs Mean vs Mode (5 pts)
Score: 5/5
Comprehensive answer covering all three metrics with concrete implications:
- Median vs Mean: Correctly explains that mean is sensitive to magnitude of counts (outliers/skew), while median is more robust. Uses "Mary" as a concrete example of how a steadily declining name would show divergence.
- Mode: Correctly identifies it as the year of peak popularity, notes it could highlight brief but intense trends, and explains how it differs from median/mean.
- Effects on results: Explicitly states different metrics would produce different rankings and explains why (brief trends vs. central tendency vs. relative popularity).
- Goes further: Proposes interquartile range and normalized proportional measures as more nuanced alternatives, and notes the choice should depend on the question being asked.
This exceeds the rubric requirement of "explains differences between median/mean/mode with examples of how they'd affect results."
---
Grading Summary
| Question | Topic | Points Earned | Max Points |
|---|---|---|---|
| Q1 | Birth patterns and historical events | 5 | 5 |
| Q2 | Babe Ruth's influence | 5 | 5 |
| Q3 | Birth rate bias and alternatives | 5 | 5 |
| Q4 | Prediction reliability | 5 | 5 |
| Q5 | Median vs mean vs mode | 5 | 5 |
| Total Written | 25 | 25 |
Overall Comments: This is an outstanding submission. Every answer is specific, data-driven, and demonstrates genuine understanding of the material. The student consistently goes beyond the minimum requirements — citing concrete examples, offering nuanced interpretations, and proposing thoughtful alternatives. The writing is clear and well-organized throughout.
Total: 25/25
diff --git a/.mcp.json b/.mcp.json
deleted file mode 100644
index 1109ee4..0000000
--- a/.mcp.json
+++ /dev/null
@@ -1,11 +0,0 @@
-{
- "mcpServers": {
- "bscs-bench": {
- "command": "/Users/bebe/Code/bscs-bench/bin/bscs-mcp",
- "args": [
- "--workspace",
- "./writeup.md b/writeup.md
index 77374b4..3c9a0ce 100644
--- a/writeup.md
+++ b/writeup.md
@@ -14,7 +14,7 @@ Complete each question below. Your answers should be thoughtful and reference yo
**Your Answer:**
-[Write your answer here]
+The total births per year show several striking patterns. There is a sharp increase in births starting around 1946 and peaking in the late 1950s—this is the well-known "Baby Boom" following World War II, when returning soldiers started families. A visible dip occurs around 1918-1919, likely reflecting the casualties and disruptions of WWI and the 1918 influenza pandemic. Another dip appears during the early 1930s, corresponding to the Great Depression, when economic hardship reduced family sizes. More recently, there is a decline in births after 2007, coinciding with the 2008 financial crisis. Female births consistently track closely with male births throughout history, though male births are slightly higher across most years.
---
@@ -22,11 +22,11 @@ Complete each question below. Your answers should be thoughtful and reference yo
### Question 2
-> Do you think the famous baseball player, Babe Ruth, had any influence on boys named Ruth? Comment on this.
+> Do you think the famous baseball player, Babe Ruth (born 1895, died 1948) had any influence on boys named Ruth? Comment on this.
**Your Answer:**
-[Write your answer here]
+Looking at the trend of boys named Ruth, there is a notable spike in the early 1900s through the 1920s, which aligns closely with Babe Ruth's rise to fame (he became a national celebrity in the late 1910s and dominated the 1920s). It seems plausible that Babe Ruth's celebrity status influenced parents to name their sons Ruth, as naming children after famous athletes was a cultural practice of that era. However, "Ruth" was already somewhat in use for boys before Ruth's career peaked, so he likely amplified an existing (though rare) trend rather than creating it entirely. After Babe Ruth's era, the name Ruth for boys essentially disappears, suggesting his influence was a key driver of what was always a small but notable phenomenon.
---
@@ -38,7 +38,9 @@ Complete each question below. Your answers should be thoughtful and reference yo
**Your Answer:**
-[Write your answer here]
+Yes, birth rate absolutely influences the oldest and youngest names. During the Baby Boom (1946–1964), there were far more births than in earlier or later decades. This means names that were popular during that era will have enormous raw counts pulling their weighted median toward those years—even if those names were also used in earlier eras. Conversely, names that were common in the 1880s–1920s had far fewer births recorded (both because the population was smaller and because SSA records are less complete for early years), so their weighted medians skew toward more recent decades than their true peak popularity.
+
+A more robust approach would be to normalize each year's counts by the total births that year before computing statistics. This converts raw counts to proportions (i.e., what fraction of all babies in a given year had that name), which removes the confounding effect of overall birth rate fluctuations. Then the weighted median of proportions—or perhaps the year of peak proportion—would better reflect when a name was truly most fashionable relative to the population.
---
@@ -50,7 +52,9 @@ Complete each question below. Your answers should be thoughtful and reference yo
**Your Answer:**
-[Write your answer here]
+The predictions are broadly plausible. For example, Gerald and Jeffrey have high predicted ages (60s–70s), which matches real-world experience—those names are strongly associated with older generations. Kai and Madison have much lower predicted ages (teens–30s), consistent with them being modern names that became popular in recent decades. Simone also aligns with a relatively younger prediction given its recent surge in popularity.
+
+That said, I would not rely heavily on these predictions for individuals. Name popularity distributions are often multimodal—some names have had multiple waves of popularity across different generations—so the median year can fall in a trough between peaks and misrepresent the actual distribution. Additionally, these predictions reflect cohort trends, not individual identity. Someone named Katherine could be anywhere from 5 to 90 years old. The predictions are useful as statistical tendencies across large populations but are too noisy to apply to a single person.
---
@@ -60,4 +64,8 @@ Complete each question below. Your answers should be thoughtful and reference yo
**Your Answer:**
-[Write your answer here]
+Median year and mean year often agree for names with roughly symmetric popularity distributions, but they diverge for names with skewed histories. A name like "Mary" that was extremely popular in the early 1900s and has declined steadily will have its weighted mean pulled toward those early peak years more strongly than its median, since the mean is sensitive to the magnitude of counts at each year. The median is more robust to such outliers and reflects the "typical" birth year.
+
+Using mode (the year with the peak total birth count for a given name) would give a different ranking still—it identifies when a name was at its absolute height of popularity, ignoring the overall shape of the distribution. A name could have a high mode count in a recent year due to a brief but intense fashion trend, even if most people with that name were born decades earlier.
+
+A more nuanced metric might be the interquartile range or the year at which cumulative births cross 50%—essentially a different formulation of the median. Using proportional popularity (counts normalized by total births per year) combined with median would be more robust. Each metric captures a different aspect of "when a name was popular," so the choice should depend on the question being asked: absolute popularity (mode), central tendency (mean/median), or relative trendiness (normalized proportional measures).
Sub-Model Usage
| Model | Input | Output | Cache Read | Cost |
|---|---|---|---|---|
| claude-sonnet-4-6 | 13 | 5.7K | 427.9K | $0.54 |
BSCS Bench