CS 328 Writing Assignment

Team Members:

Arpan Gupta (24110051)
Buha Deep (24110082)
Rathod Yuvraj (24110293)
Solanki Viraj (24110348)

1. Introduction

Elections are influenced by many interacting factors, including candidate wealth, criminal background, age, gender, education, party affiliation, and constituency-level competition. This report investigates these factors using candidate-level data from the Indian Lok Sabha elections (2019) in LS_2.0.csv. The dataset includes vote outcomes, demographics, party and category labels, criminal case counts, and financial declarations (assets and liabilities), making it suitable for evidence-based hypothesis testing.

A key preparation step in this analysis is data cleaning and standardization. Several columns contain newline formatting and financial values in mixed textual formats (for example, Lacs/Crore strings). We normalize column names, convert key fields to numeric form, and derive additional variables such as liability-to-asset ratio, margin of victory, vote share, and deposit-loss indicators. These transformations allow meaningful comparisons across candidates and states.

Following the assignment objectives, the investigation is organized into four parts: (1) summarize the dataset using descriptive statistics and visual aids, (2) posit a focused set of testable hypotheses, (3) define measurable quantities for each hypothesis (for example, win probability differences, medians, correlations, and group-wise rates), and (4) perform analysis to settle each hypothesis with data-backed justification.

The hypotheses examined in this report focus on practical electoral questions, such as whether higher wealth is associated with higher win probability, whether criminal background correlates with candidate wealth or party patterns, whether age and gender relate to electoral outcomes, and whether party effects dominate individual candidate attributes in specific states. All analyses and visualizations are carried out in Python using pandas and matplotlib.

Code

import re
import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.express as px
from scipy.stats import mannwhitneyu

pio.renderers.default = "plotly_mimetype+notebook_connected"
pio.templates.default = "plotly_white"

2. Data Loading and Cleaning

The analysis starts by loading the Lok Sabha candidate-level dataset from ./data/LS_2.0.csv into a pandas DataFrame (df). We also create a winners-only subset (winner_df) using the WINNER flag, since many hypotheses compare overall candidate patterns against winning candidates specifically.

2.1 Schema Normalization and Initial Inspection

A major challenge in this dataset is inconsistent schema formatting. Several column names contain embedded newline characters and long labels, which makes downstream analysis error-prone. To make the data analysis-ready, we rename these fields into clean, machine-friendly column names. Examples include: - CRIMINAL\nCASES -> CRIMINAL_CASES - TOTAL\nVOTES -> TOTAL_VOTES - TOTAL ELECTORS -> TOTAL_ELECTORS - OVER TOTAL VOTES POLLED \nIN CONSTITUENCY -> OVER_TOTAL_VOTES_POLLED_IN_CONSTITUENCY

After renaming, we inspect data types and null patterns using df.info() and summary statistics to identify fields that require numeric conversion or additional cleaning.

Code

df = pd.read_csv('./data/LS_2.0.csv')
winner_df = df[df['WINNER'] == 1].copy()

print(f"Total candidates: {len(df):,}")
print(f"Winning candidates: {len(winner_df):,}")

rename_map = {
    'CRIMINAL\nCASES': 'CRIMINAL_CASES',
    'GENERAL\nVOTES': 'GENERAL_VOTES',
    'POSTAL\nVOTES': 'POSTAL_VOTES',
    'TOTAL\nVOTES': 'TOTAL_VOTES',
    'OVER TOTAL ELECTORS \nIN CONSTITUENCY': 'OVER_TOTAL_ELECTORS_IN_CONSTITUENCY',
    'OVER TOTAL VOTES POLLED \nIN CONSTITUENCY': 'OVER_TOTAL_VOTES_POLLED_IN_CONSTITUENCY',
    'TOTAL ELECTORS': 'TOTAL_ELECTORS',
}
df = df.rename(columns=rename_map)
winner_df = winner_df.rename(columns=rename_map)

Total candidates: 2,263
Winning candidates: 539

2.2 Data Cleaning, Transformation, and Feature Engineering

Beyond column normalization, the core preprocessing effort addresses mixed-format numeric fields and creates derived variables needed to quantify hypotheses.

(a) Numeric Cleaning

CRIMINAL_CASES contains non-numeric values in some rows. We convert it to numeric using pd.to_numeric(..., errors='coerce'), replace invalid values with 0, and cast to integer. This ensures consistent aggregation for crime-related comparisons across parties, states, and winner/loser groups.

(b) Financial Standardization (Assets and Liabilities)

ASSETS and LIABILITIES are originally stored as semi-structured text (for example, strings with symbols and units such as Lacs/Crore). To standardize monetary values, we define a parsing function that: - extracts the numeric amount and unit using regex, - converts Lacs to Crore units (1 Crore = 100 Lacs), - returns NaN where parsing is not reliable.

Both df and winner_df are transformed to numeric values in Crore units, enabling fair and comparable financial analysis.

(c) Derived Analytical Features

To test hypotheses rigorously, we construct additional variables from the cleaned data: - liability_asset_ratio for financial risk comparisons between winners and non-winners. - RUNNER_UP_VOTES, MARGIN_OF_VICTORY, and MARGIN_PERCENT for constituency-level competitiveness and wealth-margin analysis. - HAS_CRIMINAL_RECORD as a binary indicator for crime-vs-wealth profiling. - IS_INDEPENDENT, VOTE_SHARE, and LOST_DEPOSIT to evaluate independent-candidate performance and deposit forfeiture patterns. - Group-based aggregates (for gender, education, category, party, and state) to support descriptive summaries and hypothesis settlement.

(d) Data Readiness and Limitations

After preprocessing, the dataset is suitable for both descriptive exploration and hypothesis testing through rates, medians, correlations, and group comparisons. Some values remain missing or unparseable (especially in financial declarations), so specific calculations use valid non-null observations for the relevant variables. This keeps each test transparent while preserving as much information as possible from the original election records.

Code

# (a) Numeric cleaning for CRIMINAL_CASES
df['CRIMINAL_CASES'] = pd.to_numeric(df['CRIMINAL_CASES'], errors='coerce').fillna(0).astype(int)
winner_df['CRIMINAL_CASES'] = pd.to_numeric(winner_df['CRIMINAL_CASES'], errors='coerce').fillna(0).astype(int)

# (b) Financial standardization: parse ASSETS and LIABILITIES into Crore units
def to_crore(value):
    if isinstance(value, (int, float, np.integer, np.floating)):
        return float(value)
    if pd.isna(value):
        return np.nan

    text = str(value)
    match = re.search(r'~\s*([\d.]+)\s*(Crore|Lacs?)\+?', text, re.IGNORECASE)
    if not match:
        return np.nan

    number = float(match.group(1))
    unit = match.group(2).lower()

    if 'lac' in unit:
        return number / 100.0  # 100 lakh = 1 crore
    return number

for col in ['ASSETS', 'LIABILITIES']:
    df[col] = df[col].apply(to_crore)
    winner_df[col] = winner_df[col].apply(to_crore)

# (c) Derived analytical features

# Financial risk proxy
df['liability_asset_ratio'] = df['LIABILITIES'] / (df['ASSETS'] + 1e-6)

# Criminal-record indicator
df['HAS_CRIMINAL_RECORD'] = (df['CRIMINAL_CASES'] > 0).astype(int)
winner_df['HAS_CRIMINAL_RECORD'] = (winner_df['CRIMINAL_CASES'] > 0).astype(int)

# Independent-candidate and deposit-related features
DEPOSIT_THRESHOLD = 1 / 6
df['IS_INDEPENDENT'] = df['PARTY'].astype(str).str.strip().str.lower().isin(['independent', 'ind', 'ind.']).astype(int)
df['VOTE_SHARE'] = df['TOTAL_VOTES'] / df.groupby(['STATE', 'CONSTITUENCY'])['TOTAL_VOTES'].transform('sum')
df['LOST_DEPOSIT'] = (df['VOTE_SHARE'] < DEPOSIT_THRESHOLD).astype(int)

# Constituency ranking for margin-of-victory calculations
df_sorted = df.sort_values(by=['STATE', 'CONSTITUENCY', 'TOTAL_VOTES'], ascending=[True, True, False]).copy()
df_sorted['RUNNER_UP_VOTES'] = df_sorted.groupby(['STATE', 'CONSTITUENCY'])['TOTAL_VOTES'].shift(-1)

winners = df_sorted[df_sorted['WINNER'] == 1].copy()
winners['MARGIN_OF_VICTORY'] = winners['TOTAL_VOTES'] - winners['RUNNER_UP_VOTES']
winners['MARGIN_PERCENT'] = (winners['MARGIN_OF_VICTORY'] / winners['TOTAL_ELECTORS']) * 100

3. Data Summary and Overview

We begin by summarizing the key characteristics of the prepared dataset and visualizing the overall trends.

Code

key_cols = ['STATE', 'CONSTITUENCY', 'NAME', 'PARTY']
if all(col in df.columns for col in key_cols):
    unique_candidates_df = df[key_cols].drop_duplicates()
    total_unique_candidates = len(unique_candidates_df)
else:
    total_unique_candidates = len(df)

total_candidates = len(df)
total_winners = int(df['WINNER'].eq(1).sum())
win_rate = (total_winners / total_candidates) * 100 if total_candidates else 0

total_constituencies = df[['STATE', 'CONSTITUENCY']].drop_duplicates().shape[0]
total_states = df['STATE'].nunique()

overall_min_age = pd.to_numeric(df['AGE'], errors='coerce').min()
overall_max_age = pd.to_numeric(df['AGE'], errors='coerce').max()
overall_median_age = pd.to_numeric(df['AGE'], errors='coerce').median()

def missing_pct(series):
    return series.isna().mean() * 100

summary_stats = {
    'Total Rows (Authorship-like instances)': total_candidates,
    'Total Unique Candidates': total_unique_candidates,
    'Total Winners': total_winners,
    'Win Rate (%)': f"{win_rate:.2f}",
    'Total Constituencies': total_constituencies,
    'Total States/UTs': total_states,
    'Age Range': f"{overall_min_age:.0f} - {overall_max_age:.0f}",
    'Median Age': f"{overall_median_age:.1f}",
    'Missing AGE (%)': f"{missing_pct(df['AGE']):.2f}",
    'Missing GENDER (%)': f"{missing_pct(df['GENDER']):.2f}",
    'Missing EDUCATION (%)': f"{missing_pct(df['EDUCATION']):.2f}",
    'Missing CATEGORY (%)': f"{missing_pct(df['CATEGORY']):.2f}"
}

print('Section 3 Summary Statistics')
display(pd.DataFrame(summary_stats.items(), columns=['Metric', 'Value']).set_index('Metric'))

# -------- Plot 1: Candidates per State (Top 15) --------
print('\nPlot 1: Top 15 States/UTs by Candidate Count')
state_counts = df['STATE'].value_counts().head(15).sort_values(ascending=True).reset_index()
state_counts.columns = ['STATE', 'CandidateCount']
fig1 = px.bar(
    state_counts,
    x='CandidateCount',
    y='STATE',
    orientation='h',
    title='Top 15 States/UTs by Number of Candidates',
    labels={'CandidateCount': 'Number of Candidates', 'STATE': 'State/UT'},
    text='CandidateCount'
)
fig1.update_traces(textposition='outside')
fig1.show()

# -------- Plot 2: Gender Distribution by Outcome --------
print('Plot 2: Gender Distribution by Winner Status')
gender_outcome = pd.crosstab(df['GENDER'].fillna('Unknown'), df['WINNER'])
gender_outcome = gender_outcome.rename(columns={0: 'Non-Winner', 1: 'Winner'}).reset_index()
gender_long = gender_outcome.melt(id_vars='GENDER', var_name='Outcome', value_name='Count')
fig2 = px.bar(
    gender_long,
    x='GENDER',
    y='Count',
    color='Outcome',
    barmode='stack',
    title='Gender Distribution Across Winner Status',
    labels={'GENDER': 'Gender'}
)
fig2.show()

# -------- Plot 3: Criminal Record Rate by Outcome --------
print('Plot 3: Criminal Record Rate by Winner Status')
crime_rate = (df.groupby('WINNER')['HAS_CRIMINAL_RECORD'].mean() * 100).reset_index()
crime_rate['Outcome'] = crime_rate['WINNER'].map({0: 'Non-Winner', 1: 'Winner'})
fig3 = px.bar(
    crime_rate,
    x='Outcome',
    y='HAS_CRIMINAL_RECORD',
    title='Candidates with Criminal Cases by Outcome',
    labels={'HAS_CRIMINAL_RECORD': 'Rate (%)', 'Outcome': 'Winner Status'},
    text=crime_rate['HAS_CRIMINAL_RECORD'].round(1)
)
fig3.update_traces(texttemplate='%{text:.1f}%', textposition='outside')
fig3.update_yaxes(range=[0, 100])
fig3.show()

# -------- Plot 4: Asset Percentile Line Chart by Outcome --------
print('Plot 4: Asset Percentile Profile by Winner Status')
asset_df = df[['WINNER', 'ASSETS']].dropna().copy()
non_winner_assets = asset_df.loc[asset_df['WINNER'] == 0, 'ASSETS'].values
winner_assets = asset_df.loc[asset_df['WINNER'] == 1, 'ASSETS'].values

percentiles = np.arange(10, 101, 10)
non_winner_profile = np.percentile(non_winner_assets, percentiles)
winner_profile = np.percentile(winner_assets, percentiles)

asset_line_df = pd.DataFrame({
    'Percentile': list(percentiles) + list(percentiles),
    'Assets': list(non_winner_profile) + list(winner_profile),
    'Outcome': ['Non-Winner'] * len(percentiles) + ['Winner'] * len(percentiles)
})
fig4 = px.line(
    asset_line_df,
    x='Percentile',
    y='Assets',
    color='Outcome',
    markers=True,
    title='Declared Assets Across Percentiles by Outcome',
    labels={'Assets': 'Declared Assets (Crore)'}
)
fig4.show()

# -------- Plot 5: Top Parties by Candidates and Winners --------
print('Plot 5: Top Parties by Candidates and Winners')
top_parties = df['PARTY'].value_counts().head(10).index
party_candidates = df[df['PARTY'].isin(top_parties)].groupby('PARTY').size()
party_winners = df[(df['PARTY'].isin(top_parties)) & (df['WINNER'] == 1)].groupby('PARTY').size()
party_comp = pd.DataFrame({
    'Candidates': party_candidates,
    'Winners': party_winners
}).fillna(0).reset_index().rename(columns={'index': 'PARTY'})

party_long = party_comp.melt(id_vars='PARTY', var_name='Type', value_name='Count')
fig5 = px.bar(
    party_long,
    x='PARTY',
    y='Count',
    color='Type',
    barmode='group',
    title='Top 10 Parties: Candidate vs Winner Counts',
    labels={'PARTY': 'Party'}
)
fig5.update_xaxes(tickangle=45)
fig5.show()

# -------- Plot 6: Winner Candidate Caste/Category Distribution --------
print('Plot 6: Winner Candidate Caste/Category Distribution')
winner_category = (
    df.loc[df['WINNER'] == 1, 'CATEGORY']
    .replace('', np.nan)
    .fillna('Unknown')
    .astype(str)
    .str.strip()
)
category_counts = winner_category.value_counts().reset_index()
category_counts.columns = ['Category', 'Count']

fig6 = px.pie(
    category_counts,
    names='Category',
    values='Count',
    title='Winner Candidate Caste/Category Distribution'
)
# Pull out slices to create a broken/exploded pie effect.
pull_values = [0.12 if i == 0 else 0.04 for i in range(len(category_counts))]
fig6.update_traces(textinfo='percent+label', pull=pull_values)
fig6.show()

Section 3 Summary Statistics

	Value
Metric
Total Rows (Authorship-like instances)	2263
Total Unique Candidates	2263
Total Winners	539
Win Rate (%)	23.82
Total Constituencies	542
Total States/UTs	36
Age Range	25 - 86
Median Age	52.0
Missing AGE (%)	10.83
Missing GENDER (%)	10.83
Missing EDUCATION (%)	10.83
Missing CATEGORY (%)	10.83


Plot 1: Top 15 States/UTs by Candidate Count

Plot 2: Gender Distribution by Winner Status

Plot 3: Criminal Record Rate by Winner Status

Plot 4: Asset Percentile Profile by Winner Status

Plot 5: Top Parties by Candidates and Winners

Plot 6: Winner Candidate Caste/Category Distribution (Exploded Pie)

3.1 Summary of Dataset

The prepared Lok Sabha candidate dataset (2019) contains 2263 candidate records, with 2263 unique candidates under the current deduplication logic. Out of these, 539 candidates are winners, giving an overall win rate of 23.82%. The data spans 542 constituencies across 36 states/UTs. Candidate age ranges from 25 to 86 years, with a median age of 52 years.

Data Completeness

After cleaning and feature engineering, missingness in key socio-demographic fields is moderate and consistent across the major categorical columns:

AGE: 10.83% missing
GENDER: 10.83% missing
EDUCATION: 10.83% missing
CATEGORY (caste category): 10.83% missing

Overview Trends from Section 3 Plots

The visual summaries indicate several clear patterns:

Candidate volume is highest in large states such as Uttar Pradesh, Bihar, Tamil Nadu, West Bengal, and Maharashtra.
Gender composition is strongly male-dominated among both winners and non-winners.
The share of candidates with criminal records is higher among winners than among non-winners, as shown in the criminal-record rate plot.
Asset percentile profiles suggest that winner assets are generally higher across most central percentiles, while the extreme tail reflects a few very large outliers.
Party-level comparison shows high participation from major parties, with BJP and INC contributing the largest candidate pools; BJP also contributes the highest winner count in this snapshot.
The winner caste/category pie chart shows representation across multiple categories rather than a single-category concentration.

Overall, these summary trends establish a strong descriptive baseline for the hypothesis testing in the next section.

4. Hypotheses, Analysis and Findings

Based on the dataset and initial overview, we formulate hypotheses to investigate specific trends.

H1: Wealth and Win Probability

Hypothesis: Candidates with assets above 1 Crore have much higher win probability than candidates with assets at or below 1 Crore.

Analysis: Split candidates into two asset groups using the 1 Crore threshold, then compare win rates and compute the win-rate ratio.

Code

# H1 analysis: wealth threshold vs win probability
h1_df = df[['ASSETS', 'WINNER']].dropna().copy()
threshold = 1.0  # Crore

h1_df['AssetGroup'] = np.where(h1_df['ASSETS'] > threshold, 'Above 1 Crore', '1 Crore or Below')
h1_summary = h1_df.groupby('AssetGroup')['WINNER'].agg(['count', 'sum', 'mean']).reset_index()
h1_summary.columns = ['AssetGroup', 'Candidates', 'Winners', 'WinRate']
h1_summary['WinRatePercent'] = (h1_summary['WinRate'] * 100).round(2)

above_rate = h1_summary.loc[h1_summary['AssetGroup'] == 'Above 1 Crore', 'WinRate'].iloc[0]
below_rate = h1_summary.loc[h1_summary['AssetGroup'] == '1 Crore or Below', 'WinRate'].iloc[0]
win_rate_ratio = above_rate / below_rate if below_rate > 0 else np.nan

print('H1 Result Table')
display(h1_summary[['AssetGroup', 'Candidates', 'Winners', 'WinRatePercent']])
print(f"Win-rate ratio (Above 1 Crore / 1 Crore or Below): {win_rate_ratio:.2f}x")

H1 Result Table

	AssetGroup	Candidates	Winners	WinRatePercent
0	1 Crore or Below	884	147	16.63
1	Above 1 Crore	1081	391	36.17

Win-rate ratio (Above 1 Crore / 1 Crore or Below): 2.18x

Findings: Win rate is 16.63% for candidates with assets <= 1 Crore and 36.17% for candidates above 1 Crore; the win-rate ratio is 2.18x.

Conclusion: Wealth is positively associated with winning probability, supporting the hypothesis.

H2: The Wealth Gap in Parties

Hypothesis: National parties (BJP/INC) field candidates with a median net worth significantly higher (statistically) than regional or independent candidates.

Analysis: Calculate median net worth (ASSETS - LIABILITIES) for national parties (BJP, INC) vs regional/independent candidates, then compare distributions using statistical testing.

Code

# H2 analysis: wealth gap in parties
h2_df = df[['PARTY', 'ASSETS', 'LIABILITIES']].dropna().copy()
h2_df['NetWorth'] = h2_df['ASSETS'] - h2_df['LIABILITIES']

# Define national vs regional/independent
national_parties = ['BJP', 'INC']
h2_df['PartyType'] = np.where(h2_df['PARTY'].isin(national_parties), 'National', 'Regional/Independent')

# Calculate median net worth by party type
h2_summary = h2_df.groupby('PartyType')['NetWorth'].agg(['count', 'median', 'mean']).reset_index()
h2_summary.columns = ['PartyType', 'Candidates', 'MedianNetWorth', 'MeanNetWorth']

national_networth = h2_df[h2_df['PartyType'] == 'National']['NetWorth']
regional_networth = h2_df[h2_df['PartyType'] == 'Regional/Independent']['NetWorth']

# Mann-Whitney U test for statistical significance
stat, p_value = mannwhitneyu(national_networth, regional_networth, alternative='two-sided')

print('H2 Result Table')
display(h2_summary)
print(f"\nMedian Net Worth Comparison:")
print(f"National Parties Median: {h2_summary.loc[h2_summary['PartyType'] == 'National', 'MedianNetWorth'].iloc[0]:.2f} Crore")
print(f"Regional/Independent Median: {h2_summary.loc[h2_summary['PartyType'] == 'Regional/Independent', 'MedianNetWorth'].iloc[0]:.2f} Crore")
print(f"\nMann-Whitney U Test p-value: {p_value:.6f}")
print(f"Significant difference (p < 0.05): {p_value < 0.05}")

H2 Result Table

	PartyType	Candidates	MedianNetWorth	MeanNetWorth
0	National	602	4.000	16.907542
1	Regional/Independent	684	2.185	13.812339


Median Net Worth Comparison:
National Parties Median: 4.00 Crore
Regional/Independent Median: 2.19 Crore

Mann-Whitney U Test p-value: 0.000001
Significant difference (p < 0.05): True

Findings: National parties (BJP/INC) have a median net worth of 4.00 Crore compared to 2.19 Crore for regional/independent candidates a difference of 1.83x. The Mann-Whitney U test yields p-value = 0.000001, indicating this difference is highly statistically significant (p < 0.05).

Conclusion: The hypothesis is strongly supported. National parties field candidates with significantly higher net worth than regional and independent candidates, suggesting resource advantage for national parties in fielding candidates.

H3: State-wise Wealth Concentration

Hypothesis: Candidates from Karnataka and Andhra Pradesh have a higher median asset value than candidates from Bihar and Uttar Pradesh.

Analysis: Compare median asset values for candidates from the two groups of states and perform a Mann-Whitney U test to assess statistical significance.

Code

# H3 analysis: state-wise wealth concentration
h3_df = df[['STATE', 'ASSETS']].dropna().copy()

# Define state groups
wealthy_states = ['Karnataka', 'Andhra Pradesh']
poor_states = ['Bihar', 'Uttar Pradesh']

h3_df['StateGroup'] = h3_df['STATE'].apply(
    lambda x: 'Wealthy (KA/AP)' if x in wealthy_states 
    else ('Poor (BR/UP)' if x in poor_states else 'Other')
)

# Filter to only the two groups for analysis
h3_comparison = h3_df[h3_df['StateGroup'] != 'Other'].copy()

# Calculate median assets by state group
h3_summary = h3_comparison.groupby('StateGroup')['ASSETS'].agg(['count', 'median', 'mean']).reset_index()
h3_summary.columns = ['StateGroup', 'Candidates', 'MedianAssets', 'MeanAssets']

wealthy_assets = h3_comparison[h3_comparison['StateGroup'] == 'Wealthy (KA/AP)']['ASSETS']
poor_assets = h3_comparison[h3_comparison['StateGroup'] == 'Poor (BR/UP)']['ASSETS']

# Mann-Whitney U test
from scipy.stats import mannwhitneyu
stat_h3, p_value_h3 = mannwhitneyu(wealthy_assets, poor_assets, alternative='two-sided')

print('H3 Result Table')
display(h3_summary)
print(f"\nMedian Asset Comparison:")
print(f"Wealthy States (KA/AP) Median: {h3_summary.loc[h3_summary['StateGroup'] == 'Wealthy (KA/AP)', 'MedianAssets'].iloc[0]:.2f} Crore")
print(f"Poor States (BR/UP) Median: {h3_summary.loc[h3_summary['StateGroup'] == 'Poor (BR/UP)', 'MedianAssets'].iloc[0]:.2f} Crore")
print(f"\nMann-Whitney U Test p-value: {p_value_h3:.6f}")
print(f"Significant difference (p < 0.05): {p_value_h3 < 0.05}")

# Visualization: Histogram
fig_h3_hist = px.histogram(
    h3_viz,
    x='ASSETS',
    color='StateGroup',
    barmode='overlay',
    nbins=30,
    title='Asset Distribution Histogram: Wealthy vs Poor States',
    labels={'ASSETS': 'Declared Assets (Crore)', 'StateGroup': 'State Group'},
    opacity=0.7
)
fig_h3_hist.update_xaxes(range=[h3_viz['ASSETS'].quantile(0.01), h3_viz['ASSETS'].quantile(0.99)])
fig_h3_hist.show()

H3 Result Table
H3 Result Table

	StateGroup	Candidates	MedianAssets	MeanAssets
0	Poor (BR/UP)	455	2.0	11.182989
1	Wealthy (KA/AP)	172	8.5	30.685756

H3 Result Table

	StateGroup	Candidates	MedianAssets	MeanAssets
0	Poor (BR/UP)	455	2.0	11.182989
1	Wealthy (KA/AP)	172	8.5	30.685756


Median Asset Comparison:
Wealthy States (KA/AP) Median: 8.50 Crore
Poor States (BR/UP) Median: 2.00 Crore

Mann-Whitney U Test p-value: 0.000000
Significant difference (p < 0.05): True

Findings: Candidates from wealthy states (Karnataka/Andhra Pradesh) have a median asset value of 8.50 Crore compared to 2.00 Crore for candidates from poor states (Bihar/Uttar Pradesh) a difference of 4.25x. The Mann-Whitney U test yields p-value < 0.0001, indicating this difference is highly statistically significant (p < 0.05).

Conclusion: The hypothesis is strongly supported. State wealth patterns clearly influence candidate wealth. Candidates from economically developed states (KA/AP) come with substantially higher assets than those from less developed states (BR/UP), suggesting state economic inequality translates directly to candidate resource disparities.

H4: Party Tolerance for Crime

Hypothesis: Major parties (BJP/INC) have a higher proportion of candidates with criminal records than regional parties and independents.

Analysis: Calculate the proportion of candidates with criminal records for major parties (BJP/INC) vs regional/independent candidates, then compare these proportions.

Code

# H4 analysis: party tolerance for crime
from scipy.stats import chi2_contingency

h4_df = df[['PARTY', 'CRIMINAL_CASES', 'HAS_CRIMINAL_RECORD']].dropna().copy()
major_parties = ['BJP', 'INC']

h4_df['PartyGroup'] = np.where(
    h4_df['PARTY'].isin(major_parties),
    'Major (BJP/INC)',
    'Regional/Independent'
)

# Proportion summary
h4_summary = h4_df.groupby('PartyGroup')['HAS_CRIMINAL_RECORD'].agg(['count', 'sum', 'mean']).reset_index()
h4_summary.columns = ['PartyGroup', 'Candidates', 'WithCriminalRecord', 'CriminalRate']
h4_summary['CriminalRatePercent'] = (h4_summary['CriminalRate'] * 100).round(2)

major_rate = h4_summary.loc[h4_summary['PartyGroup'] == 'Major (BJP/INC)', 'CriminalRate'].iloc[0]
other_rate = h4_summary.loc[h4_summary['PartyGroup'] == 'Regional/Independent', 'CriminalRate'].iloc[0]
rate_gap = (major_rate - other_rate) * 100

# Chi-square test for difference in proportions
h4_contingency = pd.crosstab(h4_df['PartyGroup'], h4_df['HAS_CRIMINAL_RECORD'])
chi2_h4, p_value_h4, _, _ = chi2_contingency(h4_contingency)

print('H4 Result Table')
display(h4_summary[['PartyGroup', 'Candidates', 'WithCriminalRecord', 'CriminalRatePercent']])
print(f"\nCriminal-record rate (Major): {major_rate * 100:.2f}%")
print(f"Criminal-record rate (Regional/Independent): {other_rate * 100:.2f}%")
print(f"Rate gap (Major - Regional/Independent): {rate_gap:.2f} percentage points")
print(f"\nChi-square test p-value: {p_value_h4:.6f}")
print(f"Significant difference (p < 0.05): {p_value_h4 < 0.05}")

# Visualization: direct group comparison
fig_h4_bar = px.bar(
    h4_summary,
    x='PartyGroup',
    y='CriminalRatePercent',
    text='CriminalRatePercent',
    title='H4: Criminal Record Rate by Party Group',
    labels={'CriminalRatePercent': 'Candidates with Criminal Record (%)', 'PartyGroup': 'Party Group'}
)
fig_h4_bar.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig_h4_bar.update_yaxes(range=[0, max(100, h4_summary['CriminalRatePercent'].max() + 5)])
fig_h4_bar.show()

H4 Result Table
H4 Result Table

	PartyGroup	Candidates	WithCriminalRecord	CriminalRatePercent
0	Major (BJP/INC)	833	333	39.98
1	Regional/Independent	1430	421	29.44

H4 Result Table

	PartyGroup	Candidates	WithCriminalRecord	CriminalRatePercent
0	Major (BJP/INC)	833	333	39.98
1	Regional/Independent	1430	421	29.44


Criminal-record rate (Major): 39.98%
Criminal-record rate (Regional/Independent): 29.44%
Rate gap (Major - Regional/Independent): 10.54 percentage points

Chi-square test p-value: 0.000000
Significant difference (p < 0.05): True

Findings: Major parties (BJP/INC) have 39.98% candidates with criminal records (333 out of 833), while regional/independent candidates have 29.44% (421 out of 1430). The gap is 10.54 percentage points in favor of major parties. A chi-square test gives p-value < 0.0001, showing the difference is statistically significant.

Conclusion: The hypothesis is supported. In this dataset, major parties field a significantly higher proportion of candidates with criminal records than regional and independent candidates.

H5: Criminal Record and Candidate Wealth

Hypothesis: Candidates with at least one criminal case have higher declared assets than candidates with no criminal cases.

Analysis: Split candidates into two groups using HAS_CRIMINAL_RECORD, compare median assets, run a Mann-Whitney U test for significance, and visualize distribution differences with percentile line, box, and histogram plots.

Code

# H5 analysis: criminal record vs candidate wealth
h5_df = df[['HAS_CRIMINAL_RECORD', 'ASSETS']].dropna().copy()

h5_df['CrimeGroup'] = np.where(
    h5_df['HAS_CRIMINAL_RECORD'] == 1,
    'Has Criminal Record',
    'No Criminal Record'
)

# Summary statistics by group
h5_summary = h5_df.groupby('CrimeGroup')['ASSETS'].agg(['count', 'median', 'mean']).reset_index()
h5_summary.columns = ['CrimeGroup', 'Candidates', 'MedianAssets', 'MeanAssets']

assets_crime = h5_df[h5_df['CrimeGroup'] == 'Has Criminal Record']['ASSETS']
assets_no_crime = h5_df[h5_df['CrimeGroup'] == 'No Criminal Record']['ASSETS']

# Mann-Whitney U test
stat_h5, p_value_h5 = mannwhitneyu(assets_crime, assets_no_crime, alternative='two-sided')

print('H5 Result Table')
display(h5_summary)
print('\nMedian Asset Comparison:')
print(f"Has Criminal Record Median: {h5_summary.loc[h5_summary['CrimeGroup'] == 'Has Criminal Record', 'MedianAssets'].iloc[0]:.2f} Crore")
print(f"No Criminal Record Median: {h5_summary.loc[h5_summary['CrimeGroup'] == 'No Criminal Record', 'MedianAssets'].iloc[0]:.2f} Crore")
print(f"\nMann-Whitney U Test p-value: {p_value_h5:.6f}")
print(f"Significant difference (p < 0.05): {p_value_h5 < 0.05}")

# Visualization: Percentile line plot
print('\nH5 Distribution Comparison - Line Plot (Percentiles)')
percentiles_h5 = np.arange(10, 101, 10)
crime_percentiles = np.percentile(assets_crime.dropna(), percentiles_h5)
no_crime_percentiles = np.percentile(assets_no_crime.dropna(), percentiles_h5)

h5_line_df = pd.DataFrame({
    'Percentile': list(percentiles_h5) + list(percentiles_h5),
    'Assets': list(crime_percentiles) + list(no_crime_percentiles),
    'CrimeGroup': ['Has Criminal Record'] * len(percentiles_h5) + ['No Criminal Record'] * len(percentiles_h5)
})

fig_h5_line = px.line(
    h5_line_df,
    x='Percentile',
    y='Assets',
    color='CrimeGroup',
    markers=True,
    title='Asset Percentile Profile: Criminal Record vs No Criminal Record',
    labels={'Assets': 'Declared Assets (Crore)', 'CrimeGroup': 'Candidate Group'}
)
fig_h5_line.show()

H5 Result Table

	CrimeGroup	Candidates	MedianAssets	MeanAssets
0	Has Criminal Record	746	3.0	15.187882
1	No Criminal Record	1219	2.0	11.945939

H5 Result Table

	CrimeGroup	Candidates	MedianAssets	MeanAssets
0	Has Criminal Record	746	3.0	15.187882
1	No Criminal Record	1219	2.0	11.945939


Median Asset Comparison:
Has Criminal Record Median: 3.00 Crore
No Criminal Record Median: 2.00 Crore

Mann-Whitney U Test p-value: 0.000023
Significant difference (p < 0.05): True

H5 Distribution Comparison - Line Plot (Percentiles)

Findings: Candidates with criminal records have a median declared asset of 3.00 Crore, while candidates without criminal records have a median of 2.00 Crore (a 1.50x gap). The Mann-Whitney U test gives p-value = 0.000023, indicating a statistically significant difference (p < 0.05). Distribution plots (percentile line) also show the criminal-record group shifted toward higher asset levels.

Conclusion: The hypothesis is supported. In this dataset, candidates with criminal records tend to be wealthier than those without criminal records, suggesting wealth and criminal-background presence are positively associated at the candidate level.

H6: Independent Candidates and Deposit Loss

Hypothesis: Independent candidates are significantly more likely to lose their deposit than candidates from recognized parties.

Analysis: Compare deposit-loss rates between independent and party-affiliated candidates using LOST_DEPOSIT, test significance via chi-square test, and visualize the rate gap.

Code

# H6 analysis: independent candidates vs deposit loss
from scipy.stats import chi2_contingency

h6_df = df[['IS_INDEPENDENT', 'LOST_DEPOSIT', 'PARTY']].dropna().copy()
h6_df['CandidateType'] = np.where(h6_df['IS_INDEPENDENT'] == 1, 'Independent', 'Party-Affiliated')

# Summary statistics
h6_summary = h6_df.groupby('CandidateType')['LOST_DEPOSIT'].agg(['count', 'sum', 'mean']).reset_index()
h6_summary.columns = ['CandidateType', 'Candidates', 'LostDepositCount', 'LostDepositRate']
h6_summary['LostDepositRatePercent'] = (h6_summary['LostDepositRate'] * 100).round(2)

ind_rate = h6_summary.loc[h6_summary['CandidateType'] == 'Independent', 'LostDepositRate'].iloc[0]
party_rate = h6_summary.loc[h6_summary['CandidateType'] == 'Party-Affiliated', 'LostDepositRate'].iloc[0]
rate_gap_h6 = (ind_rate - party_rate) * 100

# Chi-square test of proportion difference
h6_contingency = pd.crosstab(h6_df['CandidateType'], h6_df['LOST_DEPOSIT'])
chi2_h6, p_value_h6, _, _ = chi2_contingency(h6_contingency)

print('H6 Result Table')
display(h6_summary[['CandidateType', 'Candidates', 'LostDepositCount', 'LostDepositRatePercent']])
print(f"\nDeposit-loss rate (Independent): {ind_rate * 100:.2f}%")
print(f"Deposit-loss rate (Party-Affiliated): {party_rate * 100:.2f}%")
print(f"Rate gap (Independent - Party-Affiliated): {rate_gap_h6:.2f} percentage points")
print(f"\nChi-square test p-value: {p_value_h6:.6f}")
print(f"Significant difference (p < 0.05): {p_value_h6 < 0.05}")

# Visualization: Deposit-loss rate by candidate type
fig_h6_bar = px.bar(
    h6_summary,
    x='CandidateType',
    y='LostDepositRatePercent',
    text='LostDepositRatePercent',
    title='H6: Deposit-Loss Rate by Candidate Type',
    labels={'LostDepositRatePercent': 'Deposit-Loss Rate (%)', 'CandidateType': 'Candidate Type'}
)
fig_h6_bar.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig_h6_bar.update_yaxes(range=[0, max(100, h6_summary['LostDepositRatePercent'].max() + 5)])
fig_h6_bar.show()

H6 Result Table

	CandidateType	Candidates	LostDepositCount	LostDepositRatePercent
0	Independent	201	189	94.03
1	Party-Affiliated	2062	943	45.73


Deposit-loss rate (Independent): 94.03%
Deposit-loss rate (Party-Affiliated): 45.73%
Rate gap (Independent - Party-Affiliated): 48.30 percentage points

Chi-square test p-value: 0.000000
Significant difference (p < 0.05): True

Findings: Independent candidates have a deposit-loss rate of 94.03% (189/201), while party-affiliated candidates have 45.73% (943/2062). The gap is 48.30 percentage points. The chi-square test gives p-value < 0.0001, showing the difference is highly statistically significant.

Conclusion: The hypothesis is strongly supported. Independent candidates are far more likely to lose their deposit than party-affiliated candidates in this election dataset.

H7: The Youth Barrier

Hypothesis: Candidates under the age of 35 have the highest “deposit forfeiture” rate (getting less than 1/6th of the votes) compared to any other age group.

Analysis: Group candidates by age categories (for example, <35, 35-50, >50), calculate deposit-loss rates for each group, and compare to identify if the under-35 group has the highest rate.

Code

# H7 analysis: youth barrier and deposit forfeiture
from scipy.stats import chi2_contingency

h7_df = df[['AGE', 'LOST_DEPOSIT']].dropna().copy()
h7_df['AGE'] = pd.to_numeric(h7_df['AGE'], errors='coerce')
h7_df = h7_df.dropna().copy()

# Age groups: <35, 35-50, >50
h7_df['AgeGroup'] = pd.cut(
    h7_df['AGE'],
    bins=[-np.inf, 34, 50, np.inf],
    labels=['Under 35', '35-50', 'Above 50']
)

# Summary stats by age group
h7_summary = h7_df.groupby('AgeGroup', observed=True)['LOST_DEPOSIT'].agg(['count', 'sum', 'mean']).reset_index()
h7_summary.columns = ['AgeGroup', 'Candidates', 'LostDepositCount', 'LostDepositRate']
h7_summary['LostDepositRatePercent'] = (h7_summary['LostDepositRate'] * 100).round(2)

# Identify highest deposit-loss group
highest_group_row = h7_summary.loc[h7_summary['LostDepositRatePercent'].idxmax()]
highest_group = highest_group_row['AgeGroup']
highest_rate = highest_group_row['LostDepositRatePercent']

# Chi-square test for dependence between age-group and deposit loss
h7_contingency = pd.crosstab(h7_df['AgeGroup'], h7_df['LOST_DEPOSIT'])
chi2_h7, p_value_h7, _, _ = chi2_contingency(h7_contingency)

print('H7 Result Table')
display(h7_summary[['AgeGroup', 'Candidates', 'LostDepositCount', 'LostDepositRatePercent']])
print(f"\nHighest deposit-loss group: {highest_group} ({highest_rate:.2f}%)")
print(f"Chi-square test p-value: {p_value_h7:.6f}")
print(f"Significant age-group difference (p < 0.05): {p_value_h7 < 0.05}")

# Visualization: Line chart with top-left percentage note
h7_long = h7_summary.melt(
    id_vars='AgeGroup',
    value_vars=['Candidates', 'LostDepositCount'],
    var_name='Type',
    value_name='Count'
)
fig_h7_line = px.line(
    h7_long,
    x='AgeGroup',
    y='Count',
    color='Type',
    markers=True,
    title='H7: Candidate Count vs Deposit-Loss Count by Age Group',
    labels={'AgeGroup': 'Age Group'}
)
fig_h7_line.add_annotation(
    xref='paper',
    yref='paper',
    x=0.01,
    y=0.98,
    text=f"Highest loss rate: {highest_group} ({highest_rate:.2f}%)",
    showarrow=False,
    align='left',
    bgcolor='rgba(255,255,255,0.8)',
    bordercolor='gray',
    borderwidth=1
)
fig_h7_line.show()

H7 Result Table

	AgeGroup	Candidates	LostDepositCount	LostDepositRatePercent
0	Under 35	157	108	68.79
1	35-50	738	386	52.30
2	Above 50	1123	393	35.00


Highest deposit-loss group: Under 35 (68.79%)
Chi-square test p-value: 0.000000
Significant age-group difference (p < 0.05): True

Findings: The under-35 group has the highest deposit-loss rate at 68.79% (108/157), compared with 52.30% (386/738) for ages 35-50 and 35.00% (393/1123) for above 50. The chi-square test gives p-value < 0.0001, confirming that deposit-loss rates differ significantly across age groups.

Conclusion: The hypothesis is strongly supported. Younger candidates (under 35) face the highest risk of deposit forfeiture, indicating a clear youth barrier in electoral competitiveness.

H8: The Party Effect

Hypothesis: In Gujarat and Rajasthan, party affiliation is a stronger predictor of winning than candidate education or wealth.

Analysis: Restrict data to Gujarat and Rajasthan, compute association strength between WINNER and each factor (PARTY, EDUCATION, and wealth quartile) using chi-square and Cramer’s V, then compare effect sizes.

Code

# H8 analysis: party effect vs education and wealth in Gujarat/Rajasthan
from scipy.stats import chi2_contingency

h8_df = df[['STATE', 'PARTY', 'EDUCATION', 'ASSETS', 'WINNER']].dropna().copy()
h8_df = h8_df[h8_df['STATE'].isin(['Gujarat', 'Rajasthan'])].copy()

# Standardize categories
h8_df['PARTY'] = h8_df['PARTY'].astype(str).str.strip()
h8_df['EDUCATION'] = h8_df['EDUCATION'].astype(str).str.strip()

# Wealth buckets (quartiles)
h8_df['AssetQuartile'] = pd.qcut(h8_df['ASSETS'], q=4, labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)'])

# Keep parties with enough candidates for stability
party_counts = h8_df['PARTY'].value_counts()
major_state_parties = party_counts[party_counts >= 5].index
h8_party_df = h8_df[h8_df['PARTY'].isin(major_state_parties)].copy()

# Helper: Cramer's V from contingency table
def cramers_v_from_ct(ct):
    chi2, p, _, _ = chi2_contingency(ct)
    n = ct.to_numpy().sum()
    r, c = ct.shape
    denom = n * max(min(r - 1, c - 1), 1)
    v = np.sqrt(chi2 / denom) if denom > 0 else np.nan
    return chi2, p, v

# Contingency tables vs WINNER
ct_party = pd.crosstab(h8_party_df['PARTY'], h8_party_df['WINNER'])
ct_edu = pd.crosstab(h8_df['EDUCATION'], h8_df['WINNER'])
ct_asset = pd.crosstab(h8_df['AssetQuartile'], h8_df['WINNER'])

chi2_party, p_party, v_party = cramers_v_from_ct(ct_party)
chi2_edu, p_edu, v_edu = cramers_v_from_ct(ct_edu)
chi2_asset, p_asset, v_asset = cramers_v_from_ct(ct_asset)

h8_effect = pd.DataFrame({
    'Factor': ['Party', 'Education', 'Wealth (Asset Quartile)'],
    'ChiSquare': [chi2_party, chi2_edu, chi2_asset],
    'PValue': [p_party, p_edu, p_asset],
    'CramersV': [v_party, v_edu, v_asset]
})

print('H8 Effect-Size Comparison (Gujarat + Rajasthan)')
display(h8_effect)

# Also show win-rate spread by factor for interpretability
party_win = (h8_party_df.groupby('PARTY')['WINNER'].mean() * 100).sort_values(ascending=False).reset_index(name='WinRatePercent')
edu_win = (h8_df.groupby('EDUCATION')['WINNER'].mean() * 100).sort_values(ascending=False).reset_index(name='WinRatePercent')
asset_win = (h8_df.groupby('AssetQuartile', observed=True)['WINNER'].mean() * 100).reset_index(name='WinRatePercent')

# Visualization 1: Cramer's V comparison
fig_h8_bar = px.bar(
    h8_effect,
    x='Factor',
    y='CramersV',
    text=h8_effect['CramersV'].round(3),
    title='H8: Association Strength with Winning (Cramer\'s V)',
    labels={'CramersV': 'Cramer\'s V (Higher = Stronger Predictor)'}
)
fig_h8_bar.update_traces(textposition='outside')
fig_h8_bar.show()

# Visualization 2: Party win-rate line (top parties)
fig_h8_line = px.line(
    party_win.head(10),
    x='PARTY',
    y='WinRatePercent',
    markers=True,
    title='H8: Win Rate by Party (Top 10 in Gujarat + Rajasthan)',
    labels={'PARTY': 'Party', 'WinRatePercent': 'Win Rate (%)'}
)
fig_h8_line.update_xaxes(tickangle=45)
fig_h8_line.show()

H8 Effect-Size Comparison (Gujarat + Rajasthan)

	Factor	ChiSquare	PValue	CramersV
0	Party	127.000000	2.396197e-27	1.000000
1	Education	22.010798	8.844818e-03	0.397933
2	Wealth (Asset Quartile)	11.514143	9.247093e-03	0.287812

Findings: In Gujarat and Rajasthan, the association between winning and Party is strongest (Chi-square = 127.00, p-value < 0.0001, Cramer’s V = 1.000), compared with Education (Cramer’s V = 0.398) and Wealth quartile (Cramer’s V = 0.288). The party-wise win-rate chart also shows highly concentrated outcomes (for example, BJP at 100% in this filtered sample), indicating party label dominates prediction more than individual candidate attributes.

Conclusion: The hypothesis is supported. In these two states, party affiliation is a stronger predictor of winning than candidate education or wealth. (Interpretation should consider that some party categories are sparse in this filtered sample.)

H9: Caste Category and Win Probability

Hypothesis: Candidate caste category (CATEGORY) is significantly associated with winning probability, with measurable differences in win rates across categories.

Analysis: Clean and group caste categories, compare category-wise win rates, and test association between CATEGORY and WINNER using chi-square and Cramer’s V.

Code

# H9 analysis: caste category and winning probability
from scipy.stats import chi2_contingency

h9_df = df[['CATEGORY', 'WINNER']].copy()

# Clean category labels
h9_df['CATEGORY'] = (
    h9_df['CATEGORY']
    .replace('', np.nan)
    .fillna('Unknown')
    .astype(str)
    .str.strip()
    .str.upper()
)

# Keep categories with enough support for stable comparison
cat_counts = h9_df['CATEGORY'].value_counts()
valid_categories = cat_counts[cat_counts >= 20].index
h9_df = h9_df[h9_df['CATEGORY'].isin(valid_categories)].copy()

# Category-wise win rates
h9_summary = h9_df.groupby('CATEGORY')['WINNER'].agg(['count', 'sum', 'mean']).reset_index()
h9_summary.columns = ['Category', 'Candidates', 'Winners', 'WinRate']
h9_summary['WinRatePercent'] = (h9_summary['WinRate'] * 100).round(2)
h9_summary = h9_summary.sort_values('WinRatePercent', ascending=False).reset_index(drop=True)

# Chi-square + Cramer's V
h9_contingency = pd.crosstab(h9_df['CATEGORY'], h9_df['WINNER'])
chi2_h9, p_value_h9, _, _ = chi2_contingency(h9_contingency)

n_h9 = h9_contingency.to_numpy().sum()
r_h9, c_h9 = h9_contingency.shape
denom_h9 = n_h9 * max(min(r_h9 - 1, c_h9 - 1), 1)
cramers_v_h9 = np.sqrt(chi2_h9 / denom_h9) if denom_h9 > 0 else np.nan

top_cat = h9_summary.iloc[0]
bottom_cat = h9_summary.iloc[-1]

print('H9 Result Table')
display(h9_summary[['Category', 'Candidates', 'Winners', 'WinRatePercent']])
print(f"\nTop category by win rate: {top_cat['Category']} ({top_cat['WinRatePercent']:.2f}%)")
print(f"Lowest category by win rate: {bottom_cat['Category']} ({bottom_cat['WinRatePercent']:.2f}%)")
print(f"\nChi-square p-value: {p_value_h9:.6f}")
print(f"Cramer's V: {cramers_v_h9:.3f}")
print(f"Significant association (p < 0.05): {p_value_h9 < 0.05}")

# Visualization: candidate count vs winner count by category
h9_long = h9_summary.melt(
    id_vars='Category',
    value_vars=['Candidates', 'Winners'],
    var_name='Type',
    value_name='Count'
)
fig_h9_line = px.line(
    h9_long,
    x='Category',
    y='Count',
    color='Type',
    markers=True,
    title='H9: Candidate Count vs Winner Count by Caste Category',
    labels={'Category': 'Caste Category'}
)
fig_h9_line.update_layout(
    annotations=[dict(
        xref='paper', yref='paper', x=0.01, y=0.98,
        text=f"Cramer's V = {cramers_v_h9:.3f}, p < 0.05: {p_value_h9 < 0.05}",
        showarrow=False, bgcolor='rgba(255,255,255,0.8)', bordercolor='gray', borderwidth=1
    )]
)
fig_h9_line.show()

H9 Result Table

	Category	Candidates	Winners	WinRatePercent
0	GENERAL	1392	399	28.66
1	ST	243	55	22.63
2	SC	383	85	22.19
3	UNKNOWN	245	0	0.00


Top category by win rate: GENERAL (28.66%)
Lowest category by win rate: UNKNOWN (0.00%)

Chi-square p-value: 0.000000
Cramer's V: 0.205
Significant association (p < 0.05): True

Findings: Win rates differ across caste categories: GENERAL = 28.66% (399/1392), ST = 22.63% (55/243), and SC = 22.19% (85/383). The UNKNOWN category has 0 winners in this cleaned subset. The chi-square test gives p-value < 0.0001, and Cramer’s V = 0.205, indicating a statistically significant but moderate association between caste category and winning outcome.

Conclusion: The hypothesis is supported. Caste category is associated with winning probability in this dataset. However, effect size is moderate (not dominant), and interpretation should account for data-quality issues in the UNKNOWN category.

5. Discussion

This analysis of the 2019 Lok Sabha candidate dataset shows several clear patterns.

Wealth Matters, but Not Uniformly: Higher assets are associated with winning overall (H1), and national parties also field wealthier candidates than regional/independent ones (H2). However, wealth does not meaningfully explain margin of victory among winners (H3).

Party and Structure Are Strong Predictors: Party affiliation is one of the strongest predictors of winning in state-level slices such as Gujarat and Rajasthan (H8). Independent candidates also face a very high risk of losing their deposit (H6), showing that party label and organizational support matter a great deal.

Criminal Background and Age Patterns: Candidates with criminal cases are more likely to be wealthier (H5), major parties have a higher share of candidates with criminal records (H4), and younger candidates under 35 face the highest deposit-loss rate (H7).

Caste Effects: Caste category is associated with winning probability, but the effect is moderate rather than dominant (H9). General-category candidates win at a somewhat higher rate than SC/ST candidates in this dataset.

Limitations

The analysis is based on one election year, so patterns may not generalize to other elections.
Some variables contain missing or imputed values, especially category and education fields.
Several tests are association-based; they do not prove causation.
State- and party-level samples can be sparse in some subgroups, so those results should be interpreted cautiously.

Overall, the data suggest that party strength, candidate resources, and structural disadvantages matter more consistently than any single personal trait on its own.

6. Conclusion

This report shows that election outcomes are shaped by a combination of party strength, candidate resources, and structural factors. Wealth improves win probability, but party affiliation and deposit-loss patterns are often stronger signals than individual traits alone. Caste and criminal-record patterns are statistically meaningful, though their effects vary in strength across the dataset. Overall, the analysis suggests that electoral success is driven more by party and structural advantage than by any single candidate characteristic.