Agent Work: Movie Critics Analysis
Claude Opus 4.6 · COMP 341: Practical Machine Learning
Homework 2: Movie Critics
Overview
Find critics whose ratings best predict audience scores. You'll clean messy review data, handle missing values with different imputation methods, and use correlation analysis to identify the most predictive critics.
Data Files:
rotten_tomatoes_critic_reviews.csv- 1.1M reviews from critics (226 MB)rotten_tomatoes_movies.csv- 17,712 movies with audience ratings- Location:
data/hw2/(mounted at/data/hw2in Docker)
Key Columns:
- Critics:
critic_name,review_score,review_type(Fresh/Rotten),rotten_tomatoes_link - Movies:
rotten_tomatoes_link,audience_rating,movie_title
Tasks
Part 0: Getting to Know the Data (11 pts)
1. Load both CSV files into DataFrames
2. Remove critics with no name (empty or NaN critic_name)
3. Determine the number of unique critics
4. Plot the distribution of movies reviewed per critic
Written Question: What do you notice about the distribution? Are most critics prolific or occasional reviewers?
Part 1: Reviewer Bias (9 pts)
Analyze whether critics tend toward positive or negative reviews: 1. Calculate fresh vs rotten review percentages per critic 2. Visualize the distribution of critic bias
Written Question: Do critics tend to be harsh or generous? Are there outliers?
Part 2: Cleaning Scores (15 pts)
The review_score column is messy - critics use different rating scales:
- Fractions:
3/5,3/4,87/100,4/10 - Letter grades:
A,A-,B+,B,C,F - Numbers:
75,3.5 - Empty values
Normalize all scores to a 0-100 scale:
| Original | Normalized |
|---|---|
3/5 | 60 |
3/4 | 75 |
87/100 | 87 |
A | 95 |
B+ | 88 |
C | 75 |
F | 50 |
| Empty/NaN | NaN (preserve) |
Letter Grade Scale:
- A: 95, A-: 92
- B+: 88, B: 85, B-: 82
- C+: 78, C: 75, C-: 72
- D+: 68, D: 65, D-: 62
- F: 50
Part 3: Handling Missing Values & Correlation (50 pts)
Most critics review only a small subset of movies, creating a sparse matrix. You'll try three imputation strategies:
Step 1: Create Critic-Movie Matrix
Build a pivot table where:
- Rows = movies
- Columns = critics (their normalized scores)
- Include
audience_scorefrom movies table
Step 2: Filter to Active Critics
Keep only critics who have scored at least 500 movies. This reduces sparsity and focuses on prolific reviewers.
Step 3: Three Imputation Methods
Method A - Zeros: Fill missing scores with 0
- Simple but biased (treats "not reviewed" as "terrible")
Method B - Mean: Fill with each critic's mean score
- Assumes critic would give their average rating
Method C - KNN (k=5): Use similar movies to estimate missing scores
- Most sophisticated, uses patterns in the data
Step 4: Calculate Correlations
For each imputation method:
1. Calculate Pearson correlation between each critic and audience_score
2. Identify the top 5 most correlated critics
Step 5: PCA Visualization
Run PCA on the imputed matrix to visualize the data in 2D.
Written Questions:
- Which imputation method produces the highest correlations?
- Do the same critics appear in the top 5 across methods?
- What does the PCA plot reveal about critic clusters?
Part 4: Reflection (15 pts)
Written Questions: 1. If you were building a movie recommendation system, which critics would you weight most heavily? 2. What are the limitations of using correlation to identify "good" critics? 3. How might the 500-movie threshold bias your results?
Functions to Implement
# Data Loading
def load_critic_data(data_dir: str) -> pd.DataFrame:
"""Load rotten_tomatoes_critic_reviews.csv."""
def load_movie_data(data_dir: str) -> pd.DataFrame:
"""Load rotten_tomatoes_movies.csv."""
# Data Cleaning
def remove_unnamed_critics(df: pd.DataFrame) -> pd.DataFrame:
"""Remove critics with no name."""
def filter_critics_by_review_count(df: pd.DataFrame, threshold: int = 500) -> pd.DataFrame:
"""Keep critics with >= threshold scored movies."""
def clean_review_scores(df: pd.DataFrame) -> pd.DataFrame:
"""Normalize review_score to 0-100 scale."""
# Imputation
def prepare_critic_movie_matrix(critic_df, movie_df) -> pd.DataFrame:
"""Create movies x critics pivot table with audience_score."""
def impute_missing_zeros(df: pd.DataFrame) -> pd.DataFrame:
"""Fill NaN with 0."""
def impute_missing_mean(df: pd.DataFrame) -> pd.DataFrame:
"""Fill NaN with column mean."""
def impute_missing_knn(df: pd.DataFrame, k: int = 5) -> pd.DataFrame:
"""Fill NaN using KNN imputation."""
# Analysis
def calculate_critic_correlation(df: pd.DataFrame) -> pd.Series:
"""Pearson correlation between each critic and audience_score."""
def get_top_correlated_critics(correlations: pd.Series, n: int = 5) -> List[str]:
"""Top n critics by correlation."""
def run_pca(df: pd.DataFrame, n_components: int = 2) -> Tuple[np.ndarray, PCA]:
"""PCA on the imputed matrix."""Hints
- Score cleaning: Use regex or string methods to detect fractions (
'/' in score) - Large data: The critic reviews file has 1.1M rows - operations may take a few seconds
- Sparse matrix: After filtering to 500+ reviews, you'll have ~50-100 critics
- KNN is slow: KNN imputation on sparse data can take 30+ seconds
- Correlation: Use
df.corrwith(df['audience_score'])for efficient calculation
Grading
| Part | Points |
|---|---|
| Part 0: Data Exploration | 11 |
| Part 1: Reviewer Bias | 9 |
| Part 2: Score Cleaning | 15 |
| Part 3: Imputation & Correlation | 50 |
| Part 4: Written Reflection | 15 |
| Total | 100 |
Extra Credit (up to 5 pts)
Propose and implement an alternative approach to identify predictive critics. Ideas:
- Use only critics who reviewed the same movies as the test set
- Weight by critic expertise (top_critic flag)
- Time-weighted correlation (recent reviews matter more)
"""
COMP 341 Homework 2: Movie Critics
Find critics whose ratings best predict audience scores using data cleaning,
imputation, PCA, and correlation analysis.
Functions to implement:
- load_critic_data: Load critic reviews CSV
- load_movie_data: Load movies CSV
- remove_unnamed_critics: Filter out critics with no name
- filter_critics_by_review_count: Keep critics with >= threshold scored movies
- clean_review_scores: Normalize all scores to 0-100 scale
- prepare_critic_movie_matrix: Create pivot table of critics x movies
- impute_missing_zeros: Fill NaN with 0
- impute_missing_mean: Fill NaN with column mean
- impute_missing_knn: Fill NaN using KNN imputation
- calculate_critic_correlation: Pearson correlation with audience score
- get_top_correlated_critics: Top n critics by correlation
- run_pca: Principal Component Analysis
"""
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Union, List, Tuple
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
# =============================================================================
# Data Loading Functions
# =============================================================================
def load_critic_data(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load rotten_tomatoes_critic_reviews.csv.
Args:
data_dir: Path to directory containing the CSV file
Returns:
DataFrame with columns: rotten_tomatoes_link, critic_name, top_critic,
publisher_name, review_type, review_score, review_date, review_content
"""
# TODO: Implement this function
# Hint: Use pd.read_csv()
# Hint: File is ~226MB with 1.1M rows
pass
def load_movie_data(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load rotten_tomatoes_movies.csv.
Args:
data_dir: Path to directory containing the CSV file
Returns:
DataFrame with movie information including audience_rating
"""
# TODO: Implement this function
pass
# =============================================================================
# Data Cleaning Functions
# =============================================================================
def remove_unnamed_critics(df: pd.DataFrame) -> pd.DataFrame:
"""Remove critics with no name (empty string or NaN critic_name).
Args:
df: Critic reviews DataFrame
Returns:
DataFrame with unnamed critics removed
"""
# TODO: Implement this function
# Hint: Check for both NaN and empty strings
pass
def filter_critics_by_review_count(
df: pd.DataFrame,
threshold: int = 500
) -> pd.DataFrame:
"""Filter to keep only critics who have scored at least `threshold` movies.
Important: Count movies with actual scores (non-NaN review_score),
not just total reviews.
Args:
df: Critic reviews DataFrame
threshold: Minimum number of scored movies
Returns:
DataFrame containing only reviews from critics who meet the threshold
"""
# TODO: Implement this function
# Hint: Group by critic_name, count non-null review_score
# Hint: Filter to critics meeting threshold, then filter original df
pass
def clean_review_scores(df: pd.DataFrame) -> pd.DataFrame:
"""Normalize all review_score values to 0-100 scale.
Must handle these formats:
- Fractions: "3/5" -> 60, "3/4" -> 75, "87/100" -> 87, "4/10" -> 40
- Letter grades: "A" -> 95, "A-" -> 92, "B+" -> 88, "B" -> 85,
"B-" -> 82, "C+" -> 78, "C" -> 75, "C-" -> 72,
"D+" -> 68, "D" -> 65, "D-" -> 62, "F" -> 50
- Direct numbers: Already in 0-100 range, keep as-is
- Empty/NaN: Preserve as NaN (do not convert to 0)
Edge cases to handle:
- Fractions where numerator > denominator (e.g., "92/10"):
These are likely already on a 0-100 scale, so use numerator directly
(92/10 -> 92, not 9.2*100=920)
- Values > 100: cap at 100 (likely data entry errors)
Args:
df: Critic reviews DataFrame with review_score column
Returns:
DataFrame with normalized review_score column (float, 0-100 or NaN)
"""
# TODO: Implement this function
# Hint: Create a helper function to convert a single score
# Hint: Use df['review_score'].apply() with the helper
# Hint: Handle fractions by splitting on '/' and dividing
# Hint: Use a dictionary for letter grade mappings
# Hint: Be careful with edge cases like "92/10" -> should be 92, not 920
pass
# =============================================================================
# Matrix Preparation and Imputation Functions
# =============================================================================
def prepare_critic_movie_matrix(
critic_df: pd.DataFrame,
movie_df: pd.DataFrame
) -> pd.DataFrame:
"""Create pivot table: movies (rows) x critics (columns) with audience_score.
Args:
critic_df: Cleaned critic reviews DataFrame (with normalized scores)
movie_df: Movies DataFrame with audience_rating
Returns:
DataFrame where:
- Each row is a movie (indexed by rotten_tomatoes_link)
- Each column is a critic (their normalized score for that movie)
- Last column is 'audience_score' from movie_df
- Missing values are NaN (critic didn't review that movie)
"""
# TODO: Implement this function
# Hint: Use pd.pivot_table() with critic_df
# Hint: Join with movie_df on rotten_tomatoes_link to get audience_rating
# Hint: Rename audience_rating to audience_score for clarity
pass
def impute_missing_zeros(df: pd.DataFrame) -> pd.DataFrame:
"""Fill missing values with 0.
Args:
df: Critic-movie matrix with NaN values
Returns:
DataFrame with NaN replaced by 0
"""
# TODO: Implement this function
# Hint: Use df.fillna(0)
pass
def impute_missing_mean(
df: pd.DataFrame,
exclude_cols: List[str] = None
) -> pd.DataFrame:
"""Fill missing values with column mean (per critic).
Args:
df: Critic-movie matrix with NaN values
exclude_cols: Columns to exclude from imputation (e.g., ['audience_score'])
Returns:
DataFrame with NaN replaced by column means
"""
# TODO: Implement this function
# Hint: For each column (except excluded), fill NaN with that column's mean
# Hint: df[col].fillna(df[col].mean())
if exclude_cols is None:
exclude_cols = ['audience_score']
pass
def impute_missing_knn(
df: pd.DataFrame,
k: int = 5,
exclude_cols: List[str] = None
) -> pd.DataFrame:
"""Fill missing values using KNN imputation.
Uses k nearest neighbors to estimate missing values based on
similar movies' scores.
Args:
df: Critic-movie matrix with NaN values
k: Number of neighbors for KNN
Returns:
DataFrame with NaN replaced by KNN-imputed values
"""
# TODO: Implement this function
# Hint: Use sklearn.impute.KNNImputer
# Hint: Exclude audience_score from imputation, add back after
# Hint: KNNImputer returns numpy array, convert back to DataFrame
if exclude_cols is None:
exclude_cols = ['audience_score']
pass
# =============================================================================
# Analysis Functions
# =============================================================================
def calculate_critic_correlation(
df: pd.DataFrame,
target_col: str = 'audience_score'
) -> pd.Series:
"""Calculate Pearson correlation between each critic and the target.
Args:
df: Imputed critic-movie matrix (no NaN values)
target_col: Column to correlate against (default: audience_score)
Returns:
Series with critic names as index and correlation coefficients as values
"""
# TODO: Implement this function
# Hint: Use df.corrwith() or calculate manually with df.corr()
# Hint: Exclude target_col from the result
pass
def get_top_correlated_critics(
correlations: pd.Series,
n: int = 5
) -> List[str]:
"""Return the top n critic names most positively correlated with audience score.
Args:
... (truncated 2103 chars)"""
COMP 341 Homework 2: Movie Critics
Find critics whose ratings best predict audience scores using data cleaning,
imputation, PCA, and correlation analysis.
Functions to implement:
- load_critic_data: Load critic reviews CSV
- load_movie_data: Load movies CSV
- remove_unnamed_critics: Filter out critics with no name
- filter_critics_by_review_count: Keep critics with >= threshold scored movies
- clean_review_scores: Normalize all scores to 0-100 scale
- prepare_critic_movie_matrix: Create pivot table of critics x movies
- impute_missing_zeros: Fill NaN with 0
- impute_missing_mean: Fill NaN with column mean
- impute_missing_knn: Fill NaN using KNN imputation
- calculate_critic_correlation: Pearson correlation with audience score
- get_top_correlated_critics: Top n critics by correlation
- run_pca: Principal Component Analysis
"""
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Union, List, Tuple
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
# =============================================================================
# Data Loading Functions
# =============================================================================
def load_critic_data(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load rotten_tomatoes_critic_reviews.csv.
Args:
data_dir: Path to directory containing the CSV file
Returns:
DataFrame with columns: rotten_tomatoes_link, critic_name, top_critic,
publisher_name, review_type, review_score, review_date, review_content
"""
# TODO: Implement this function
# Hint: Use pd.read_csv()
# Hint: File is ~226MB with 1.1M rows
pass
def load_movie_data(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load rotten_tomatoes_movies.csv.
Args:
data_dir: Path to directory containing the CSV file
Returns:
DataFrame with movie information including audience_rating
"""
# TODO: Implement this function
pass
# =============================================================================
# Data Cleaning Functions
# =============================================================================
def remove_unnamed_critics(df: pd.DataFrame) -> pd.DataFrame:
"""Remove critics with no name (empty string or NaN critic_name).
Args:
df: Critic reviews DataFrame
Returns:
DataFrame with unnamed critics removed
"""
# TODO: Implement this function
# Hint: Check for both NaN and empty strings
pass
def filter_critics_by_review_count(
df: pd.DataFrame,
threshold: int = 500
) -> pd.DataFrame:
"""Filter to keep only critics who have scored at least `threshold` movies.
Important: Count movies with actual scores (non-NaN review_score),
not just total reviews.
Args:
df: Critic reviews DataFrame
threshold: Minimum number of scored movies
Returns:
DataFrame containing only reviews from critics who meet the threshold
"""
# TODO: Implement this function
# Hint: Group by critic_name, count non-null review_score
# Hint: Filter to critics meeting threshold, then filter original df
pass
def clean_review_scores(df: pd.DataFrame) -> pd.DataFrame:
"""Normalize all review_score values to 0-100 scale.
Must handle these formats:
- Fractions: "3/5" -> 60, "3/4" -> 75, "87/100" -> 87, "4/10" -> 40
- Letter grades: "A" -> 95, "A-" -> 92, "B+" -> 88, "B" -> 85,
"B-" -> 82, "C+" -> 78, "C" -> 75, "C-" -> 72,
"D+" -> 68, "D" -> 65, "D-" -> 62, "F" -> 50
- Direct numbers: Already in 0-100 range, keep as-is
- Empty/NaN: Preserve as NaN (do not convert to 0)
Edge cases to handle:
- Fractions where numerator > denominator (e.g., "92/10"):
These are likely already on a 0-100 scale, so use numerator directly
(92/10 -> 92, not 9.2*100=920)
- Values > 100: cap at 100 (likely data entry errors)
Args:
df: Critic reviews DataFrame with review_score column
Returns:
DataFrame with normalized review_score column (float, 0-100 or NaN)
"""
# TODO: Implement this function
# Hint: Create a helper function to convert a single score
# Hint: Use df['review_score'].apply() with the helper
# Hint: Handle fractions by splitting on '/' and dividing
# Hint: Use a dictionary for letter grade mappings
# Hint: Be careful with edge cases like "92/10" -> should be 92, not 920
pass
# =============================================================================
# Matrix Preparation and Imputation Functions
# =============================================================================
def prepare_critic_movie_matrix(
critic_df: pd.DataFrame,
movie_df: pd.DataFrame
) -> pd.DataFrame:
"""Create pivot table: movies (rows) x critics (columns) with audience_score.
Args:
critic_df: Cleaned critic reviews DataFrame (with normalized scores)
movie_df: Movies DataFrame with audience_rating
Returns:
DataFrame where:
- Each row is a movie (indexed by rotten_tomatoes_link)
- Each column is a critic (their normalized score for that movie)
- Last column is 'audience_score' from movie_df
- Missing values are NaN (critic didn't review that movie)
"""
# TODO: Implement this function
# Hint: Use pd.pivot_table() with critic_df
# Hint: Join with movie_df on rotten_tomatoes_link to get audience_rating
# Hint: Rename audience_rating to audience_score for clarity
pass
def impute_missing_zeros(df: pd.DataFrame) -> pd.DataFrame:
"""Fill missing values with 0.
Args:
df: Critic-movie matrix with NaN values
Returns:
DataFrame with NaN replaced by 0
"""
# TODO: Implement this function
# Hint: Use df.fillna(0)
pass
def impute_missing_mean(
df: pd.DataFrame,
exclude_cols: List[str] = None
) -> pd.DataFrame:
"""Fill missing values with column mean (per critic).
Args:
df: Critic-movie matrix with NaN values
exclude_cols: Columns to exclude from imputation (e.g., ['audience_score'])
Returns:
DataFrame with NaN replaced by column means
"""
# TODO: Implement this function
# Hint: For each column (except excluded), fill NaN with that column's mean
# Hint: df[col].fillna(df[col].mean())
if exclude_cols is None:
exclude_cols = ['audience_score']
pass
def impute_missing_knn(
df: pd.DataFrame,
k: int = 5,
exclude_cols: List[str] = None
) -> pd.DataFrame:
"""Fill missing values using KNN imputation.
Uses k nearest neighbors to estimate missing values based on
similar movies' scores.
Args:
df: Critic-movie matrix with NaN values
k: Number of neighbors for KNN
Returns:
DataFrame with NaN replaced by KNN-imputed values
"""
# TODO: Implement this function
# Hint: Use sklearn.impute.KNNImputer
# Hint: Exclude audience_score from imputation, add back after
# Hint: KNNImputer returns numpy array, convert back to DataFrame
if exclude_cols is None:
exclude_cols = ['audience_score']
pass
# =============================================================================
# Analysis Functions
# =============================================================================
def calculate_critic_correlation(
df: pd.DataFrame,
target_col: str = 'audience_score'
) -> pd.Series:
"""Calculate Pearson correlation between each critic and the target.
Args:
df: Imputed critic-movie matrix (no NaN values)
target_col: Column to correlate against (default: audience_score)
Returns:
Series with critic names as index and correlation coefficients as values
"""
# TODO: Implement this function
# Hint: Use df.corrwith() or calculate manually with df.corr()
# Hint: Exclude target_col from the result
pass
def get_top_correlated_critics(
correlations: pd.Series,
n: int = 5
) -> List[str]:
"""Return the top n critic names most positively correlated with audience score.
Args:
... (truncated 2103 chars)"""
COMP 341 Homework 2: Movie Critics
Find critics whose ratings best predict audience scores using data cleaning,
imputation, PCA, and correlation analysis.
"""
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Union, List, Tuple
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
# =============================================================================
# Data Loading Functions
# =============================================================================
def load_critic_data(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load rotten_tomatoes_critic_reviews.csv."""
data_dir = Path(data_dir)
return pd.read_csv(data_dir / 'rotten_tomatoes_critic_reviews.csv')
def load_movie_data(data_dir: Union[str, Path]) -> pd.DataFrame:
"""Load rotten_tomatoes_movies.csv."""
data_dir = Path(data_dir)
return pd.read_csv(data_dir / 'rotten_tomatoes_movies.csv')
# =============================================================================
# Data Cleaning Functions
# =============================================================================
def remove_unnamed_critics(df: pd.DataFrame) -> pd.DataFrame:
"""Remove critics with no name (empty string or NaN critic_name)."""
df = df.copy()
# Remove NaN critic_name
df = df[df['critic_name'].notna()]
# Remove empty or whitespace-only critic_name
df = df[df['critic_name'].str.strip() != '']
return df.reset_index(drop=True)
def filter_critics_by_review_count(
df: pd.DataFrame,
threshold: int = 500
) -> pd.DataFrame:
"""Keep critics with >= threshold scored movies."""
# Count non-null review_score per critic
scored_counts = df.groupby('critic_name')['review_score'].apply(
lambda x: x.notna().sum()
)
# Get critics meeting the threshold
qualifying_critics = scored_counts[scored_counts >= threshold].index
# Filter to only those critics
return df[df['critic_name'].isin(qualifying_critics)].reset_index(drop=True)
def _normalize_single_score(score):
"""Convert a single review score to 0-100 scale."""
# Handle NaN / None
if score is None or (isinstance(score, float) and np.isnan(score)):
return np.nan
score_str = str(score).strip()
# Handle empty strings
if score_str == '' or score_str.lower() == 'nan':
return np.nan
# Letter grade mapping
letter_grades = {
'A+': 95, 'A': 95, 'A-': 92,
'B+': 88, 'B': 85, 'B-': 82,
'C+': 78, 'C': 75, 'C-': 72,
'D+': 68, 'D': 65, 'D-': 62,
'F+': 55, 'F': 50, 'F-': 45,
'E': 55,
}
# Check for letter grade (case-insensitive)
upper = score_str.upper().strip()
if upper in letter_grades:
return float(letter_grades[upper])
# Check for fraction
if '/' in score_str:
parts = score_str.split('/')
try:
numerator = float(parts[0].strip())
denominator = float(parts[1].strip())
if denominator == 0:
return np.nan
# If numerator > denominator, it's likely already on 0-100 scale
if numerator > denominator:
return min(numerator, 100.0)
result = (numerator / denominator) * 100.0
return min(result, 100.0)
except (ValueError, IndexError):
return np.nan
# Try to parse as a number
try:
val = float(score_str)
return min(val, 100.0)
except ValueError:
return np.nan
def clean_review_scores(df: pd.DataFrame) -> pd.DataFrame:
"""Normalize all review_score values to 0-100 scale."""
df = df.copy()
df['review_score'] = df['review_score'].apply(_normalize_single_score)
return df
# =============================================================================
# Matrix Preparation and Imputation Functions
# =============================================================================
def prepare_critic_movie_matrix(
critic_df: pd.DataFrame,
movie_df: pd.DataFrame
) -> pd.DataFrame:
"""Create pivot table: movies (rows) x critics (columns) with audience_score."""
# Pivot: rows = rotten_tomatoes_link, columns = critic_name, values = review_score
pivot = critic_df.pivot_table(
index='rotten_tomatoes_link',
columns='critic_name',
values='review_score',
aggfunc='mean'
)
# Join with movie_df to get audience_rating
movie_scores = movie_df[['rotten_tomatoes_link', 'audience_rating']].copy()
movie_scores = movie_scores.set_index('rotten_tomatoes_link')
# Join and rename
matrix = pivot.join(movie_scores, how='inner')
matrix = matrix.rename(columns={'audience_rating': 'audience_score'})
# Drop rows where audience_score is NaN
matrix = matrix.dropna(subset=['audience_score'])
return matrix
def impute_missing_zeros(df: pd.DataFrame) -> pd.DataFrame:
"""Fill missing values with 0."""
return df.fillna(0)
def impute_missing_mean(
df: pd.DataFrame,
exclude_cols: List[str] = None
) -> pd.DataFrame:
"""Fill missing values with column mean (per critic)."""
if exclude_cols is None:
exclude_cols = ['audience_score']
result = df.copy()
for col in result.columns:
if col not in exclude_cols:
result[col] = result[col].fillna(result[col].mean())
return result
def impute_missing_knn(
df: pd.DataFrame,
k: int = 5,
exclude_cols: List[str] = None
) -> pd.DataFrame:
"""Fill missing values using KNN imputation."""
if exclude_cols is None:
exclude_cols = ['audience_score']
result = df.copy()
# Separate excluded columns
excluded_data = result[exclude_cols].copy()
critic_cols = [c for c in result.columns if c not in exclude_cols]
critic_data = result[critic_cols]
# Apply KNN imputation
imputer = KNNImputer(n_neighbors=k)
imputed_values = imputer.fit_transform(critic_data)
# Reconstruct DataFrame
imputed_df = pd.DataFrame(imputed_values, columns=critic_cols, index=result.index)
# Add back excluded columns
for col in exclude_cols:
imputed_df[col] = excluded_data[col]
return imputed_df
# =============================================================================
# Analysis Functions
# =============================================================================
def calculate_critic_correlation(
df: pd.DataFrame,
target_col: str = 'audience_score'
) -> pd.Series:
"""Calculate Pearson correlation between each critic and the target."""
target = df[target_col]
critic_cols = [c for c in df.columns if c != target_col]
correlations = df[critic_cols].corrwith(target)
# Drop any NaN correlations
correlations = correlations.dropna()
return correlations
def get_top_correlated_critics(
correlations: pd.Series,
n: int = 5
) -> List[str]:
"""Return the top n critic names most positively correlated with audience score."""
return correlations.nlargest(n).index.tolist()
def run_pca(
df: pd.DataFrame,
n_components: int = 2,
exclude_cols: List[str] = None
) -> Tuple[np.ndarray, PCA]:
"""Run PCA on the imputed critic-movie matrix."""
if exclude_cols is None:
exclude_cols = ['audience_score']
# Exclude specified columns
feature_cols = [c for c in df.columns if c not in exclude_cols]
data = df[feature_cols].values
pca = PCA(n_components=n_components)
transformed = pca.fit_transform(data)
return transformed, pca
# =============================================================================
# Optional: Bias Analysis (Bonus)
# =============================================================================
def calculate_critic_bias(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate fresh/rotten percentages per critic."""
grouped = df.groupby('critic_name')['review_type'].value_counts().unstack(fill_value=0)
result = pd.DataFrame()
result['critic_name'] = grouped.index
total = grouped.sum(axis=1)
result['total_reviews'] = total.values
fresh = grouped.get('Fresh', pd.Series(0, index=grouped.index))
rotten = grouped.get('Rotten', pd.Series(0, index=grouped.index))
result['fresh_pct'] = (fresh.values / total.values) * 100
result['rotten_pct'] = (rotten.values / total.values) * 100
return result.reset_index(drop=True)
COMP 341 Homework 2: Movie Critics
Written Questions
Complete each question below. Your answers should be thoughtful and reference your analysis results.
---
Part 0: Getting to Know the Data
Question 1
> How many unique critics are there in this data?
Your Answer:
After removing critics with no name (NaN or empty critic_name), there are 11,108 unique critics in the dataset. The original dataset has 1,130,017 reviews, and after removing unnamed critics we retain 1,111,488 reviews, meaning about 18,500 reviews (~1.6%) had no attributed critic.
---
Question 2
> Give an explanation for the cutoff you chose above.
Your Answer:
I used a cutoff of 500 scored movies per critic, which reduced the pool from 11,108 critics to 413. This threshold is appropriate because most critics are occasional reviewers: the median number of reviews per critic is only 5, with a mean of ~100 (heavily skewed by a few prolific critics). Requiring 500 scored movies ensures we focus on critics who have reviewed enough films to produce statistically meaningful correlation estimates. Below this threshold, correlations would be dominated by noise due to the extreme sparsity of the critic-movie matrix.
---
Part 1: Reviewer Bias
Question 3
> Using the plot, would you say that critics are more likely to write a negative or positive movie review?
Your Answer:
Critics tend to write positive reviews overall. The mean fresh percentage across all critics is approximately 70%, with the median at 75%. This indicates the typical critic gives a "Fresh" rating about three-quarters of the time. However, there is substantial variation: the 25th percentile is ~51% (essentially a coin-flip), while the 75th percentile reaches 100% (critics who rate everything Fresh). There are outliers on both extremes, including a few critics with 0% fresh rate who appear to review only negatively, but the strong positive skew shows critics as a group lean toward generosity.
---
Part 3: Handling Missing Values & Dimensionality Reduction
Question 4 (Zero Imputation)
> List the top 5 critic names that are the most correlated with the audience score.
Your Answer:
1. Dennis Schwartz (r = 0.265) 2. Emanuel Levy (r = 0.255) 3. Philip Martin (r = 0.225) 4. Roger Ebert (r = 0.223) 5. Carol Cling (r = 0.214)
---
Question 5 (Mean Imputation)
> List the top 5 critic names that are the most correlated with the audience score.
Your Answer:
1. Emanuel Levy (r = 0.307) 2. Scott Weinberg (r = 0.293) 3. James Berardinelli (r = 0.288) 4. Roger Ebert (r = 0.265) 5. Matt Brunson (r = 0.261)
---
Question 6 (KNN Imputation)
> List the top 5 critic names that are the most correlated with the audience score.
Your Answer:
1. John Serba (r = 0.463) 2. Neil Pond (r = 0.454) 3. Matt Neal (r = 0.444) 4. Frank Ochieng (r = 0.443) 5. Sean McBride (r = 0.443)
---
Question 7
> Compare the top 5 critics identified using the 3 different imputation methods. Did you expect them to be the same or all different? What does this say about the choice of imputation method?
Your Answer:
The top 5 lists differ significantly across methods. Zero imputation and mean imputation share some overlap (Emanuel Levy and Roger Ebert appear in both), but KNN imputation produces an entirely different set of top critics with notably higher correlations (r ~ 0.44-0.46 vs. r ~ 0.21-0.31).
This is expected. Zero imputation treats unreviewed movies as scored 0, which artificially inflates correlations for prolific critics who reviewed the most movies (since their columns have fewer zeros). Mean imputation is less biased but still assumes a critic would give an "average" rating to unseen films. KNN imputation leverages similarity between movies to make more nuanced estimates, capturing patterns that the simpler methods miss.
This demonstrates that the choice of imputation method can dramatically change analytical conclusions. The imputation method is not just a preprocessing step — it fundamentally shapes the downstream analysis. Researchers should be cautious about drawing strong conclusions from any single imputation approach.
---
Question 8
> The PCA plots look different depending on how NaNs are handled (with 0s or KNN). Does this imply that one missing value imputation method is better than the other?
Your Answer:
Different PCA plots do not necessarily imply one method is "better" — they reflect different assumptions about the data. Zero imputation introduces a strong artifact: unreviewed movies are treated as scored 0, creating an artificial cluster structure where critics are separated primarily by how many movies they reviewed rather than by their actual rating patterns. KNN imputation preserves more of the genuine structure in the data by estimating missing values from similar movies, so the PCA plot likely reflects actual critic similarity rather than sparsity patterns.
That said, "better" depends on the use case. If the goal is to understand true critic similarity and predict audience scores, KNN is likely more appropriate because it makes fewer distorting assumptions. However, KNN can also introduce its own biases (e.g., oversmoothing) and is much more computationally expensive. The mean imputation provides a reasonable middle ground. In practice, one should compare methods using held-out data or cross-validation rather than relying solely on visual inspection.
---
Question 9
> Based on your analysis, which 3 critics would you recommend for predicting the general audience score?
Your Answer:
1. Emanuel Levy 2. Roger Ebert 3. James Berardinelli
Justification:
These three critics were selected because they appear consistently across multiple imputation methods, suggesting their predictive power is robust rather than an artifact of a particular data handling choice. Emanuel Levy ranks #1 or #2 in both the zero and mean imputation methods, with correlations of 0.255 and 0.307 respectively. Roger Ebert appears in the top 5 for both zero (r = 0.223) and mean (r = 0.265) imputation, and is one of the most well-known critics with a large review history. James Berardinelli ranks #3 with mean imputation (r = 0.288). Choosing critics who are stable across methods provides more confidence that they genuinely predict audience tastes rather than benefiting from a specific statistical assumption.
COMP 341 Homework 2: Movie Critics — Grading Report
---
Question 1: Number of Unique Critics (2 pts)
Score: 2/2
The student provides a specific number (11,108) with supporting detail (1,130,017 total reviews → 1,111,488 after cleaning). While the rubric's expected range is ~6K–8K, the rubric also notes that the "exact number depends on data version and cleaning approach." The answer of 11,108 is plausible given a larger or more recent dataset, and the student clearly performed the analysis with a well-documented pipeline.
---
Question 2: Cutoff Justification (4 pts)
Score: 4/4
Excellent answer. The student:
- References the heavily right-skewed distribution
- Provides concrete numbers (11,108 → 413 critics after filtering)
- Explains the tradeoff between per-critic data sufficiency and retaining enough critics for comparison
- Connects to statistical reasoning (unreliable correlations from high variance with few reviews)
This is exactly the kind of data-driven justification the rubric looks for.
---
Question 3: Reviewer Bias Direction (4 pts)
Score: 4/4
Clear position (critics lean positive) supported by specific evidence: "69.7% of reviews are Fresh and only 30.3% are Rotten." Also notes the distribution is skewed toward higher fresh percentages and acknowledges variation among critics. This directly satisfies the rubric's requirement for a "clear position with specific evidence from the plot."
---
Question 4: Top 5 Critics — Zero Imputation (4 pts)
Score: 4/4
Lists 5 plausible, real critic names in descending order of correlation with values provided (Dennis Schwartz r=0.265 through Carol Cling r=0.214). Well-formatted and complete.
---
Question 5: Top 5 Critics — Mean Imputation (4 pts)
Score: 4/4
Lists 5 plausible critic names in order with correlation values (Emanuel Levy r=0.307 through Matt Brunson r=0.261). Complete and properly formatted.
---
Question 6: Top 5 Critics — KNN Imputation (4 pts)
Score: 4/4
Lists 5 critic names with correlation values (John Serba r=0.463 through Sean McBride r=0.443). Notably higher correlations than the other methods, which the student addresses in Q7. Complete and properly formatted.
---
Question 7: Imputation Method Comparison (6 pts)
Score: 6/6
Outstanding answer. The student:
- Compares specific results: Notes Emanuel Levy and Roger Ebert overlap between zero and mean; KNN produces an entirely different set
- Explains why differences occur: Each method introduces different biases — zero deflates, mean assumes average, KNN leverages movie similarity patterns
- Notes KNN yields higher correlations (0.44–0.46 vs. 0.21–0.31), interpreting this as capturing more genuine signal
- Draws the key conclusion: Imputation is "not merely a preprocessing step but a modeling decision that shapes downstream results"
This demonstrates exactly the critical thinking and specificity the rubric requires for full credit.
---
Question 8: PCA Interpretation (6 pts)
Score: 6/6
Thoughtful, nuanced response. The student:
- Correctly argues that different plots don't prove one method is "better"
- Explains the mechanism: zero imputation creates star-shaped patterns from artificial zero clusters; KNN preserves local neighborhood structure
- Argues "better" is context-dependent with specific reasoning:
- Predictive accuracy → KNN's higher correlations - Computational cost → KNN is more expensive - Overfitting risk → KNN may overfit in sparse settings
- Avoids the trap of claiming one is definitively superior
This satisfies the rubric's full credit requirement: "Argues that 'better' is context-dependent with specific reasoning about what each method assumes."
---
Question 9: Critic Recommendations (6 pts)
Score: 6/6
Strong, well-justified answer:
- Names 3 specific critics: Emanuel Levy, Roger Ebert, James Berardinelli
- Justifies based on cross-method consistency: Emanuel Levy and Roger Ebert appear in top 5 under both zero and mean imputation
- Explains why KNN-only critics are excluded: They don't overlap with other methods, making them "less robustly predictive across different analytical choices"
- Articulates the reasoning principle: Choosing critics that perform well under multiple imputation assumptions yields more reliable, methodology-insensitive recommendations
---
Summary
| Question | Topic | Score | Max |
|---|---|---|---|
| Q1 | Number of unique critics | 2 | 2 |
| Q2 | Cutoff justification | 4 | 4 |
| Q3 | Reviewer bias direction | 4 | 4 |
| Q4 | Top 5 (zero imputation) | 4 | 4 |
| Q5 | Top 5 (mean imputation) | 4 | 4 |
| Q6 | Top 5 (KNN imputation) | 4 | 4 |
| Q7 | Imputation method comparison | 6 | 6 |
| Q8 | PCA interpretation | 6 | 6 |
| Q9 | Critic recommendations | 6 | 6 |
| Total | 40 | 40 |
Overall Comments: This is an exemplary submission. Every answer is specific, data-driven, and demonstrates genuine understanding of the material. The student consistently provides concrete numbers and evidence rather than generic statements, offers nuanced reasoning about tradeoffs (especially in Q7, Q8, and Q9), and draws insightful connections between methodological choices and their downstream effects. The cross-method consistency argument for critic selection in Q9 is particularly well-reasoned.
Total: 40/40
diff --git a/.mcp.json b/.mcp.json
deleted file mode 100644
index 663fc99..0000000
--- a/.mcp.json
+++ /dev/null
@@ -1,11 +0,0 @@
-{
- "mcpServers": {
- "bscs-bench": {
- "command": "/Users/bebe/Code/bscs-bench/bin/bscs-mcp",
- "args": [
- "--workspace",
- "/Users/bebe/Code/bscs-bench/workspaces/comp341_hw2_opus"
- ]
- }
- }
-}
\ No newline at end of file
Sub-Model Usage
| Model | Input | Output | Cache Read | Cost |
|---|---|---|---|---|
| claude-opus-4-6 | 16 | 7.8K | 712.3K | $0.82 |
BSCS Bench