📊 Sampling and Descriptive Statistics

📊 Sampling and Descriptive Statistics#

Understanding how to summarize and interpret data is foundational in statistics. This module introduces key concepts in sampling, population vs. sample statistics, and types of averages.

🎯 Population vs. Sample Statistics#

Concept	Population (Parameter)	Sample (Statistic)
Mean	$ \mu = \frac{1}{N} \sum_{i=1}^{N} x_i $	$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $
Variance	$ \sigma^2 = \frac{1}{N} \sum (x_i - \mu)^2 $	$ s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2 $
Standard Deviation	$ \sigma $	$ s $
Proportion	$ P $	$ \hat{p} $

Population: Entire group of interest
Sample: Subset used to estimate population characteristics

🧪 Sampling Methods#

Simple Random Sampling: Every member has equal chance
Stratified Sampling: Divide into subgroups and sample proportionally
Systematic Sampling: Select every $k^{th}$ item
Cluster Sampling: Randomly select entire groups

📈 Descriptive Statistics#

Descriptive statistics summarize data using:

Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Range, Variance, Standard Deviation
Shape: Skewness, Kurtosis
Position: Percentiles, Quartiles

📊 Comparison of Mean Types#

Type of Mean	Formula	Use Case
Arithmetic Mean	$ \bar{x} = \frac{1}{n} \sum x_i $	General average
Geometric Mean	$ \left( \prod x_i \right)^{1/n} $	Growth rates, ratios
Harmonic Mean	$ \frac{n}{\sum \frac{1}{x_i}} $	Rates, speeds
Median	Middle value (sorted data)	Skewed distributions
Mode	Most frequent value	Categorical or discrete data

✅ Summary#

Sampling allows inference from limited data
Descriptive statistics summarize key features
Choosing the right average depends on context
Sample statistics use $n-1$ in variance to reduce bias
Population statistics assume full data and use $n$

# 📊 Interactive Statistics Module (Unweighted, Row-wise Display)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider

# Synthetic dataset generator
def generate_data(n=30, mean=50, std=10):
    np.random.seed(42)
    data = np.random.normal(loc=mean, scale=std, size=n)
    return np.round(data, 4)  # Round to 4 significant digits

# Statistics calculator
def compute_statistics(n=30, mean=50, std=10):
    x = generate_data(n, mean, std)

    # Sample statistics (ddof=1)
    mean_sample = np.mean(x)
    var_sample = np.var(x, ddof=1)
    std_sample = np.std(x, ddof=1)

    # Population statistics (ddof=0)
    var_pop = np.var(x, ddof=0)
    std_pop = np.std(x, ddof=0)

    # Display statistics
    print("📘 Sample Statistics:")
    print(f"Mean: {mean_sample:.4f}")
    print(f"Variance (Sample): {var_sample:.4f}")
    print(f"Standard Deviation (Sample): {std_sample:.4f}\n")

    print("📗 Population Statistics:")
    print(f"Variance (Population): {var_pop:.4f}")
    print(f"Standard Deviation (Population): {std_pop:.4f}\n")

    print("📋 Sample Data (25 values per row):")
    for i in range(0, len(x), 25):
        row = x[i:i+25]
        formatted = "  ".join(f"{val:.2f}" for val in row)
        print(formatted)


    # Plot distribution
    plt.figure(figsize=(10, 5))
    plt.hist(x, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
    plt.axvline(mean_sample, color='blue', linestyle='--', label='Sample Mean')
    plt.title("Distribution of Synthetic Sample Data")
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widget
interact(compute_statistics,
         n=IntSlider(min=10, max=100, step=10, value=30, description='Sample Size'),
         mean=FloatSlider(min=30, max=70, step=1, value=50, description='Mean'),
         std=FloatSlider(min=5, max=20, step=1, value=10, description='Std Dev'));

⚖️ Types of Weighted Means: Concepts, Applications, and Limitations#

Weighted means are used when data points contribute unequally to an average. This module explores different types of weighted means, their formulas, use cases, and potential drawbacks.

📐 What Is a Weighted Mean?#

A weighted mean adjusts each value’s contribution based on its assigned weight:

\[ \bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} \]

Where:

$x_i$ are the data values
$w_i$ are the corresponding weights

🧮 Types of Weighted Means#

Type	Formula	Application Example	Limitations
Arithmetic Weighted Mean	$$ \bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} $$	GPA, average cost, survey scores	Sensitive to outliers and large weights
Geometric Weighted Mean	$$ \bar{x}_g = \prod x_i^{w_i / \sum w_i} $$	Growth rates, portfolio returns	Cannot handle zero or negative values
Harmonic Weighted Mean	$$ \bar{x}_h = \frac{\sum w_i}{\sum \frac{w_i}{x_i}} $$	Average speed, rates, efficiency	Undefined for $x_i = 0$; sensitive to small values
Quadratic Mean (RMS)	$$ \bar{x}_{rms} = \sqrt{ \frac{\sum w_i x_i^2}{\sum w_i} } $$	Signal processing, error analysis	Overemphasizes large values

🔍 Detailed Descriptions#

1. Arithmetic Weighted Mean#

Most common form
Each value contributes proportionally to its weight
Used in education (GPA), economics (weighted price index), and polling

2. Geometric Weighted Mean#

Multiplies values raised to weight fractions
Ideal for compounding scenarios (e.g., investment returns)
Cannot be used with zero or negative values

3. Harmonic Weighted Mean#

Inverse of weighted average of reciprocals
Best for averaging rates (e.g., speed over varying distances)
Sensitive to very small values and undefined for zero

4. Quadratic Mean (Root Mean Square)#

Square values before averaging
Common in physics and engineering (e.g., RMS voltage)
Amplifies large deviations

⚠️ Summary of Limitations#

Arithmetic: Skewed by extreme weights or outliers
Geometric: Requires all values $> 0$
Harmonic: Undefined for zero, unstable for small values
Quadratic: Not robust to outliers, emphasizes large values

✅ Choosing the Right Mean#

Scenario	Recommended Mean
Weighted average of scores	Arithmetic Weighted Mean
Compounded growth	Geometric Weighted Mean
Averaging rates or speeds	Harmonic Weighted Mean
Signal or error magnitude	Quadratic Mean (RMS)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider

# Generate synthetic values
def generate_values(n=30, val_min=10, val_max=100):
    np.random.seed(42)
    return np.round(np.random.uniform(val_min, val_max, size=n), 4)

# Normalize weights
def normalize(weights):
    return weights / np.sum(weights)

# Weighting schemes
def unweighted_weights(n):
    return np.ones(n) / n

def random_weights(n, weight_min=1, weight_max=5):
    raw = np.random.randint(weight_min, weight_max + 1, size=n)
    return normalize(raw)

def center_weights(values, val_min, val_max):
    midpoint = (val_min + val_max) / 2
    distances = np.abs(values - midpoint)
    raw = 1 / (distances + 1e-3)
    return normalize(raw)

# Comparison demo
def compare_all_weights(n=30, val_min=10, val_max=100):
    values = generate_values(n, val_min, val_max)
    w_unweighted = unweighted_weights(n)
    w_random = random_weights(n)
    w_center = center_weights(values, val_min, val_max)

    mean_unweighted = np.average(values, weights=w_unweighted)
    mean_random = np.average(values, weights=w_random)
    mean_center = np.average(values, weights=w_center)

    # Display table
    df = pd.DataFrame({
        'Value': values,
        'Unweighted': np.round(w_unweighted, 4),
        'Random Weight': np.round(w_random, 4),
        'Center Weight': np.round(w_center, 4)
    })
    print(df)

    print(f"\n⚖️ Mean Comparisons:")
    print(f"🔹 Unweighted Mean: {mean_unweighted:.4f}")
    print(f"🔹 Random Weighted Mean: {mean_random:.4f}")
    print(f"🔹 Center-Weighted Mean: {mean_center:.4f}")

    # Plot
    plt.figure(figsize=(12, 6))
    plt.scatter(values, np.zeros_like(values), s=w_unweighted * 1000, c='gray', alpha=0.4, label='Unweighted')
    plt.scatter(values, np.ones_like(values), s=w_random * 1000, c='blue', alpha=0.5, label='Random Weights')
    plt.scatter(values, np.ones_like(values) * 2, s=w_center * 1000, c='orange', alpha=0.5, label='Center Weights')
    plt.axvline(mean_unweighted, color='gray', linestyle='--', label='Unweighted Mean')
    plt.axvline(mean_random, color='blue', linestyle='--', label='Random Mean')
    plt.axvline(mean_center, color='orange', linestyle='--', label='Center Mean')
    plt.yticks([0, 1, 2], ['Unweighted', 'Random', 'Center'])
    plt.title("Comparison of Weighting Schemes (Normalized)")
    plt.xlabel("Value")
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widget
interact(compare_all_weights,
         n=IntSlider(min=10, max=100, step=10, value=30, description='Sample Size'),
         val_min=IntSlider(min=0, max=50, step=5, value=10, description='Min Value'),
         val_max=IntSlider(min=50, max=150, step=10, value=100, description='Max Value'));

📏 Goodness-of-Fit Measures: Overview and Interpretation#

Goodness-of-fit metrics quantify how well a model or simulation replicates observed data. These measures help evaluate prediction accuracy, error magnitude, and overall model reliability.

🔢 Common Goodness-of-Fit Metrics#

Measure	Formula	Interpretation
RMSE (Root Mean Square Error)	$$ \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (O_i - P_i)^2 } $$	Penalizes large errors; lower is better
MAE (Mean Absolute Error)	$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n}	O_i - P_i
NSE (Nash–Sutcliffe Efficiency)	$$ \text{NSE} = 1 - \frac{ \sum (O_i - P_i)^2 }{ \sum (O_i - \bar{O})^2 } $$	Compares model to mean of observed; 1 is perfect, 0 is mean model
R² (Coefficient of Determination)	$$ R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} $$	Proportion of variance explained; closer to 1 is better
Log-Normalized RMSE	$$ \text{Log-RMSE} = \sqrt{ \frac{1}{n} \sum (\log O_i - \log P_i)^2 } $$	Useful when data spans orders of magnitude; reduces skew from large values
Bias (Mean Error)	$$ \text{Bias} = \frac{1}{n} \sum (P_i - O_i) $$	Indicates systematic over- or under-prediction
Percent Bias (PBIAS)	$$ \text{PBIAS} = 100 \times \frac{ \sum (O_i - P_i) }{ \sum O_i } $$	Expresses bias as percentage; ideal value is 0

📘 Interpretation Guidelines#

Metric	Ideal Value	Notes
RMSE, MAE	→ 0	Lower values indicate better fit
NSE, R²	→ 1	Higher values indicate better predictive power
Bias, PBIAS	→ 0	Positive = overprediction; Negative = underprediction
Log-RMSE	Contextual	Useful for log-transformed or multiplicative error models

🧪 Use Cases#

Scenario	Recommended Metric(s)
General error magnitude	RMSE, MAE
Hydrology / environmental models	NSE, PBIAS
Log-transformed or skewed data	Log-RMSE
Model bias detection	Bias, Percent Bias
Regression performance	R², RMSE

⚠️ Limitations#

RMSE is sensitive to large errors (outliers).
NSE can be misleading if observed variance is low.
Log-RMSE requires strictly positive values.
R² does not indicate bias or error magnitude.

import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider, Output

# Output widget
out = Output()

# Standard growth model (e.g., logistic-like)
def generate_observed_growth(n, r=0.2, K=100, P0=10):
    t = np.arange(n)
    observed = K / (1 + ((K - P0) / P0) * np.exp(-r * t))
    return t, observed

# Simulated growth model (user-controlled)
def generate_simulated_growth(n, r_sim=0.25, K_sim=95, P0_sim=12):
    t = np.arange(n)
    simulated = K_sim / (1 + ((K_sim - P0_sim) / P0_sim) * np.exp(-r_sim * t))
    return simulated

# Goodness-of-fit metrics
def compute_metrics(observed, predicted):
    eps = 1e-6
    mean_obs = np.mean(observed)
    residuals = observed - predicted

    rmse = np.sqrt(np.mean(residuals**2))
    mae = np.mean(np.abs(residuals))
    bias = np.mean(predicted - observed)
    pbias = 100 * np.sum(observed - predicted) / np.sum(observed)
    nse = 1 - np.sum(residuals**2) / np.sum((observed - mean_obs)**2)
    r2 = 1 - np.sum(residuals**2) / np.sum((observed - mean_obs)**2)
    log_rmse = np.sqrt(np.mean((np.log(observed + eps) - np.log(predicted + eps))**2))

    return {
        "RMSE": rmse,
        "MAE": mae,
        "Bias": bias,
        "PBIAS (%)": pbias,
        "NSE": nse,
        "R²": r2,
        "Log-RMSE": log_rmse
    }

# Interactive comparison
def compare_growth_models(n=30, r_sim=0.25, K_sim=95, P0_sim=12):
    t, observed = generate_observed_growth(n)
    simulated = generate_simulated_growth(n, r_sim, K_sim, P0_sim)
    metrics = compute_metrics(observed, simulated)

    out.clear_output()
    with out:
        print("📘 Goodness-of-Fit Metrics:")
        for k, v in metrics.items():
            print(f"{k}: {v:.4f}")

        print("\n📗 Interpretation:")
        print("✅ Lower RMSE/MAE → better fit")
        print("✅ NSE and R² close to 1 → strong predictive power")
        print("✅ Bias and PBIAS near 0 → minimal systematic error")
        print("✅ Log-RMSE useful for skewed or multiplicative data")

        # Plot comparison
        plt.figure(figsize=(10, 5))
        plt.plot(t, observed, label="Observed Growth", marker='o')
        plt.plot(t, simulated, label="Simulated Growth", marker='x')
        plt.title("Observed vs. Simulated Growth Model")
        plt.xlabel("Time")
        plt.ylabel("Population")
        plt.legend()
        plt.grid(True)
        plt.show()

# Interactive controls
interact(
    compare_growth_models,
    n=IntSlider(min=10, max=100, step=5, value=30, description="Time Steps"),
    r_sim=FloatSlider(min=0.1, max=0.5, step=0.01, value=0.25, description="Growth Rate"),
    K_sim=FloatSlider(min=50, max=150, step=5, value=95, description="Carrying Capacity"),
    P0_sim=FloatSlider(min=5, max=30, step=1, value=12, description="Initial Population")
)

out

🎲 Understanding Probability and Its Role in Risk Assessment#

📘 What Is Probability?#

1 means the event is certain.
Values between 0 and 1 reflect varying degrees of uncertainty.
Probability is foundational to:
- Statistics
- Decision-making
- Machine learning
- Engineering risk analysis

🔍 Why Is Probability Important?#

Domain	Role of Probability
Statistics	Models randomness in sampling, inference, and prediction.
Engineering	Assesses reliability, failure rates, and safety margins.
Computer Science	Powers algorithms in AI, cryptography, and simulations.
Finance	Quantifies market risk, portfolio volatility, and insurance models.
Environmental Science	Forecasts extreme events like floods, droughts, and pollutant dispersion.

⚖️ Probability in Risk Assessment#

Risk assessment involves estimating the likelihood and impact of adverse outcomes. Probability helps:

Model uncertainty in system behavior (e.g., soil failure, structural collapse).
Quantify exposure to hazards (e.g., earthquakes, contamination).
Support decision-making under uncertainty (e.g., cost-benefit analysis, safety thresholds).

🧪 Example: Geotechnical Risk#

In slope stability analysis:

Probability of failure is estimated using soil strength variability.
Monte Carlo simulations or probabilistic limit equilibrium methods are used.
Risk = Probability × Consequence

📊 Key Probability Concepts#

Concept	Description
Experiment	A repeatable process with uncertain outcome (e.g., rolling a die).
Sample Space	The set of all possible outcomes.
Event	A subset of outcomes (e.g., rolling an even number).
Probability Rule	$P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total outcomes}}$
Complement Rule	$P(\text{not } A) = 1 - P(A)$
Addition Rule	$P(A \cup B) = P(A) + P(B) - P(A \cap B)$
Multiplication Rule	$P(A \cap B) = P(A) \cdot P(B

🎯 Interactive Simulation: Empirical Probability and Return Period#

This simulation demonstrates how to estimate the empirical probability of an event exceeding a threshold and compute its return period using synthetic data.

🔍 What the Code Does#

Generates synthetic event magnitudes using a uniform distribution between val_min and val_max.
Allows the user to select a threshold value.
Calculates:
- The number of events that exceed the threshold.
- The empirical probability of exceedance:
  $P(\text{event}) = \frac{\text{Number of exceedances}}{\text{Total events}}$
- The return period (or recurrence interval):
  $\text{Return Period} = \frac{1}{P(\text{event})}$

📈 Visualization#

A histogram shows the distribution of event magnitudes.
A red dashed line marks the selected threshold.
Summary statistics are printed to reinforce interpretation.

🧪 Applications#

This simulation is useful for exploring:

Flood frequency analysis
Earthquake magnitude exceedance
Slope failure probability
Any domain where event magnitude vs. threshold matters

⚙️ Interactive Controls#

Sample Size: Number of synthetic events generated.
Min Value and Max Value: Range of event magnitudes.
Threshold: Value used to define a “critical” or “extreme” event.

📊 Example Use Case#

If you set:

val_min = 0, val_max = 100
threshold = 80
n = 100

The simulation will estimate how often events exceed 80 and how frequently such events are expected to recur.

Would you like to extend this with time-series simulation or allow multiple thresholds for comparative risk analysis?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider, FloatSlider

# Generate synthetic time-series event magnitudes
def generate_time_series(n=100, val_min=0, val_max=100):
    np.random.seed(42)
    magnitudes = np.round(np.random.uniform(val_min, val_max, size=n), 2)
    time = np.arange(1, n + 1)
    return pd.DataFrame({'Time': time, 'Magnitude': magnitudes})

# Simulation with multiple thresholds
def multi_threshold_demo(n=100, val_min=0, val_max=100, t1=60, t2=80, t3=100):
    df = generate_time_series(n, val_min, val_max)
    thresholds = [t1, t2, t3]
    results = []

    for t in thresholds:
        exceed = df['Magnitude'] > t
        prob = exceed.sum() / n
        rp = 1 / prob if prob > 0 else np.inf
        results.append((t, prob, rp))
        df[f'> {t}'] = exceed

    # Display results
    print("📊 Empirical Probability and Return Periods:")
    for t, p, r in results:
        print(f"Threshold {t:>3}:  Probability = {p:.4f},  Return Period = {r:.2f} events")

    # Plot time series
    plt.figure(figsize=(12, 5))
    plt.plot(df['Time'], df['Magnitude'], marker='o', linestyle='-', color='gray', label='Event Magnitude')
    for t in thresholds:
        plt.axhline(t, linestyle='--', label=f'Threshold {t}')
    plt.title("Synthetic Time-Series of Event Magnitudes")
    plt.xlabel("Time")
    plt.ylabel("Magnitude")
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widget
interact(multi_threshold_demo,
         n=IntSlider(min=50, max=500, step=50, value=100, description='Sample Size'),
         val_min=IntSlider(min=0, max=50, step=5, value=0, description='Min Value'),
         val_max=IntSlider(min=50, max=150, step=10, value=100, description='Max Value'),
         t1=FloatSlider(min=0, max=150, step=1, value=60, description='Threshold 1'),
         t2=FloatSlider(min=0, max=150, step=1, value=80, description='Threshold 2'),
         t3=FloatSlider(min=0, max=150, step=1, value=100, description='Threshold 3'));

📊 Using Theoretical Distributions in Risk and Probability Analysis#

When empirical data is scarce, unreliable, or incomplete, we turn to theoretical probability distributions to model uncertainty and estimate risk. These distributions are based on mathematical assumptions and are widely used in engineering, environmental science, and decision analysis.

📘 Why Use Theoretical Distributions?#

Empirical distributions rely on observed data, which may be:
- Too sparse (e.g., rare events like major earthquakes)
- Incomplete (e.g., short monitoring periods)
- Biased or noisy
Theoretical distributions provide:
- A structured way to model uncertainty
- Flexibility to simulate rare or extreme events
- A foundation for probabilistic design and risk assessment

📐 Common Theoretical Distributions#

Distribution	Typical Use Case	Shape Characteristics
Normal	Measurement errors, material properties	Symmetric, bell-shaped
Lognormal	Soil permeability, rainfall intensity	Skewed right, non-negative
Exponential	Time between failures, hazard occurrence	Memoryless, decreasing tail
Weibull	Reliability, fatigue life	Flexible shape for increasing/decreasing hazard
Gumbel	Extreme value modeling (e.g., floods, wind)	Models maxima/minima of distributions
Binomial	Discrete event counts (e.g., success/failure)	Finite trials, fixed probability
Poisson	Rare event frequency (e.g., landslides/year)	Count-based, low probability

⚖️ Application in Risk Assessment#

Theoretical distributions help estimate:

Probability of exceedance:
$P(X > x)$ for a critical threshold $x$
Return period:
$\text{Return Period} = \frac{1}{P(\text{event})}$
Confidence intervals for uncertain parameters
Failure probabilities in reliability models

🧪 Example: Flood Risk Using Gumbel Distribution#

If historical flood data is limited, we can fit a Gumbel distribution to estimate:

Probability of a flood exceeding 500 m³/s
Expected return period of such an event
Design thresholds for levees or drainage systems

🧠 Key Takeaways#

Theoretical distributions are essential when empirical data is insufficient.
Choice of distribution depends on the nature of the variable (continuous vs. discrete, symmetric vs. skewed).
They enable simulation, forecasting, and probabilistic design under uncertainty.

Distribution fitting and goodness of fit#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from ipywidgets import interact, SelectMultiple, FloatSlider, IntSlider, Button, VBox, HBox, Output

# Output widget for dynamic display
out = Output()

# Available distributions
available_distributions = {
    "Normal": stats.norm,
    "Lognormal": stats.lognorm,
    "Exponential": stats.expon,
    "Weibull": stats.weibull_min
}

# Global data container
data = None

# Simulation function
def simulate_data(n=500, shape=2.0, scale=10.0):
    global data
    np.random.seed(42)
    data = np.random.gamma(shape=shape, scale=scale, size=n)

# Fit and analyze selected distributions
def analyze(selected, threshold):
    out.clear_output()
    with out:
        if data is None:
            print("⚠️ Please click 'Simulate' first.")
            return

        x = np.linspace(min(data), max(data), 100)
        fit_results = []

        # Empirical exceedance
        exceed_emp = data > threshold
        prob_emp = exceed_emp.sum() / len(data)
        rp_emp = 1 / prob_emp if prob_emp > 0 else np.inf

        print(f"📊 Empirical Probability of Exceeding {threshold:.2f}: {prob_emp:.4f}")
        print(f"🔁 Empirical Return Period: {rp_emp:.2f} events\n")

        for name in selected:
            dist = available_distributions[name]
            params = dist.fit(data)
            pdf_fitted = dist.pdf(x, *params)
            ks_stat, ks_p = stats.kstest(data, dist.name, args=params)
            log_likelihood = np.sum(dist.logpdf(data, *params))
            k = len(params)
            aic = 2 * k - 2 * log_likelihood

            # Confidence intervals (approximate using bootstrap)
            ci_lower = np.percentile(data, 2.5)
            ci_upper = np.percentile(data, 97.5)

            # Theoretical exceedance
            prob_theo = dist.sf(threshold, *params)
            rp_theo = 1 / prob_theo if prob_theo > 0 else np.inf

            fit_results.append({
                "Distribution": name,
                "KS Statistic": round(ks_stat, 4),
                "KS p-value": round(ks_p, 4),
                "AIC": round(aic, 2),
                "95% CI": f"[{ci_lower:.2f}, {ci_upper:.2f}]",
                "P(X > threshold)": round(prob_theo, 4),
                "Return Period": round(rp_theo, 2)
            })

            # Plot
            plt.figure(figsize=(6, 4))
            plt.hist(data, bins=30, density=True, alpha=0.5, label="Empirical")
            plt.plot(x, pdf_fitted, label=f"{name} Fit", linewidth=2)
            plt.axvline(threshold, color='red', linestyle='--', label=f"Threshold = {threshold}")
            plt.title(f"{name} Distribution Fit")
            plt.xlabel("Value")
            plt.ylabel("Density")
            plt.legend()
            plt.grid(True)
            plt.show()

        # Summary table
        df_results = pd.DataFrame(fit_results)
        print("📊 Goodness-of-Fit and Risk Summary:")
        display(df_results.sort_values("AIC"))

# Widgets
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_data())

distribution_selector = SelectMultiple(
    options=list(available_distributions.keys()),
    value=["Normal", "Lognormal"],
    description="Distributions"
)

threshold_slider = FloatSlider(
    value=80,
    min=0,
    max=200,
    step=1,
    description="Threshold"
)

analyze_btn = Button(description="Analyze", button_style='info')
analyze_btn.on_click(lambda b: analyze(distribution_selector.value, threshold_slider.value))

# Layout
VBox([
    HBox([simulate_btn, analyze_btn]),
    distribution_selector,
    threshold_slider,
    out
])

🧪 Hypothesis Testing: Concepts, Types, and Applications#

Hypothesis testing is a statistical method used to make decisions or inferences about population parameters based on sample data.

🎯 Core Concepts#

Null Hypothesis ($H_0$): Assumes no effect or no difference.
Alternative Hypothesis ($H_1$ or $H_a$): Assumes a real effect or difference exists.
Test Statistic: A value computed from sample data used to decide whether to reject $H_0$.
p-value: Probability of observing the test statistic under $H_0$.
Significance Level ($\alpha$): Threshold for rejecting $H_0$, commonly 0.05.
Confidence Interval: Range of values likely to contain the true parameter.

🔍 Types of Hypothesis Tests#

Test Type	Use Case	Assumptions	Example Scenario
Z-test	Compare means (large $n$, known $\sigma$)	Normal distribution, known variance	Mean weight of manufactured parts
t-test	Compare means (small $n$, unknown $\sigma$)	Normality, equal variance (for 2-sample)	Drug effectiveness between two groups
Chi-square test	Test for independence or goodness-of-fit	Categorical data, expected counts ≥ 5	Survey response vs. age group
ANOVA	Compare means across ≥3 groups	Normality, equal variance	Crop yield across different fertilizers
Mann–Whitney U	Non-parametric test for two independent groups	No assumption of normality	Median income between regions
Wilcoxon Signed-Rank	Non-parametric test for paired samples	Symmetric distribution of differences	Before/after treatment scores
Kolmogorov–Smirnov	Compare distributions or test normality	Continuous data	Fit of theoretical distribution
Log-rank test	Compare survival curves	Censored data, proportional hazards	Time-to-failure in reliability studies

📊 Decision Process#

Formulate hypotheses: $H_0$ and $H_1$
Choose significance level: $\alpha = 0.05$
Select appropriate test based on data type and assumptions
Compute test statistic and p-value
Compare p-value to $\alpha$:
- If $p < \alpha$: Reject $H_0$
- If $p \geq \alpha$: Fail to reject $H_0$

🧠 Applications in Engineering and Science#

Quality control: Z-test, t-test for process monitoring
Environmental analysis: ANOVA for comparing pollutant levels
Reliability: Log-rank test for component lifetimes
Geotechnical studies: Non-parametric tests for soil properties
Remote sensing: Chi-square for classification accuracy

Would you like to extend this with interactive examples or code snippets for each test?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
from ipywidgets import interact, IntSlider, FloatSlider, Button, Output, VBox, HBox

# Output widget
out = Output()

# Global data
group1 = None
group2 = None

# Simulate data
def simulate_ttest(n=30, mu1=50, mu2=55, sigma=10):
    global group1, group2
    np.random.seed(42)
    group1 = np.random.normal(loc=mu1, scale=sigma, size=n)
    group2 = np.random.normal(loc=mu2, scale=sigma, size=n)

# Run t-test and plot
def run_ttest():
    out.clear_output()
    with out:
        if group1 is None or group2 is None:
            print("⚠️ Please simulate data first.")
            return

        # Perform t-test
        stat, p = ttest_ind(group1, group2)
        mean1, mean2 = np.mean(group1), np.mean(group2)

        # Plot
        plt.figure(figsize=(6, 4))
        plt.hist(group1, bins=20, alpha=0.6, label="Group 1", color='blue')
        plt.hist(group2, bins=20, alpha=0.6, label="Group 2", color='orange')
        plt.axvline(mean1, color='blue', linestyle='--', label=f"Mean 1 = {mean1:.2f}")
        plt.axvline(mean2, color='orange', linestyle='--', label=f"Mean 2 = {mean2:.2f}")
        plt.title("Distribution of Two Groups")
        plt.xlabel("Value")
        plt.ylabel("Frequency")
        plt.legend()
        plt.grid(True)
        plt.show()

        # Summary
        print("📊 Two-Sample t-Test Results:")
        print(f"t-statistic: {stat:.4f}")
        print(f"p-value: {p:.4f}")
        if p < 0.05:
            print("✅ Significant difference (reject H₀)")
        else:
            print("❌ No significant difference (fail to reject H₀)")

# Widgets
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_ttest(n_slider.value, mu1_slider.value, mu2_slider.value))

run_btn = Button(description="Run t-Test", button_style='info')
run_btn.on_click(lambda b: run_ttest())

n_slider = IntSlider(value=30, min=10, max=100, step=5, description="Sample Size")
mu1_slider = FloatSlider(value=50, min=30, max=70, step=1, description="Mean Group 1")
mu2_slider = FloatSlider(value=55, min=30, max=70, step=1, description="Mean Group 2")

# Layout
VBox([
    HBox([simulate_btn, run_btn]),
    n_slider,
    mu1_slider,
    mu2_slider,
    out
])

Hypothesis Testing#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from ipywidgets import Dropdown, IntSlider, FloatSlider, Button, Output, VBox, HBox

# Output widget
out = Output()

# Global data containers
data_store = {}

# Test options
test_options = [
    "Z-test",
    "t-test",
    "Chi-square test",
    "ANOVA",
    "Mann–Whitney U",
    "Wilcoxon Signed-Rank",
    "Kolmogorov–Smirnov",
    "Log-rank test (simulated)"
]

# Simulate data
def simulate_data(n, mu1, mu2):
    np.random.seed(42)
    data_store.clear()
    data_store["x1"] = np.random.normal(mu1, 10, n)
    data_store["x2"] = np.random.normal(mu2, 10, n)
    data_store["before"] = np.random.normal(mu1, 10, n)
    data_store["after"] = data_store["before"] + np.random.normal(mu2 - mu1, 5, n)
    data_store["anova"] = [
        np.random.normal(50, 10, n),
        np.random.normal(55, 10, n),
        np.random.normal(60, 10, n)
    ]
    data_store["survival1"] = np.random.exponential(scale=10, size=n)
    data_store["survival2"] = np.random.exponential(scale=15, size=n)

# Analyze selected test
def analyze_test(test_type):
    out.clear_output()
    with out:
        if not data_store:
            print("⚠️ Please click 'Simulate' first.")
            return

        def interpret(p, alpha=0.05):
            if p < alpha:
                return f"✅ Result is statistically significant (p < {alpha}). Reject the null hypothesis."
            else:
                return f"❌ Result is not statistically significant (p ≥ {alpha}). Fail to reject the null hypothesis."

        if test_type == "Z-test":
            mu = 10
            x1, x2 = data_store["x1"], data_store["x2"]
            z = (np.mean(x1) - np.mean(x2)) / np.sqrt(mu**2/len(x1) + mu**2/len(x2))
            p = 2 * (1 - stats.norm.cdf(abs(z)))
            print(f"Z-test: z = {z:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "t-test":
            stat, p = stats.ttest_ind(data_store["x1"], data_store["x2"])
            print(f"t-test: t = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Chi-square test":
            observed = np.array([[20, 30], [25, 25]])
            stat, p, _, _ = stats.chi2_contingency(observed)
            print(f"Chi-square test: χ² = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "ANOVA":
            stat, p = stats.f_oneway(*data_store["anova"])
            print(f"ANOVA: F = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Mann–Whitney U":
            stat, p = stats.mannwhitneyu(data_store["x1"], data_store["x2"])
            print(f"Mann–Whitney U: U = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Wilcoxon Signed-Rank":
            stat, p = stats.wilcoxon(data_store["before"], data_store["after"])
            print(f"Wilcoxon Signed-Rank: W = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Kolmogorov–Smirnov":
            x = data_store["x1"]
            stat, p = stats.kstest(x, 'norm', args=(np.mean(x), np.std(x)))
            print(f"Kolmogorov–Smirnov: D = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Log-rank test (simulated)":
            stat, p = stats.ttest_ind(data_store["survival1"], data_store["survival2"])
            print(f"Log-rank (simulated): t = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        # Optional plot
        plt.figure(figsize=(6, 4))
        if test_type in ["Z-test", "t-test", "Mann–Whitney U"]:
            plt.hist(data_store["x1"], bins=20, alpha=0.6, label="Group 1")
            plt.hist(data_store["x2"], bins=20, alpha=0.6, label="Group 2")
        elif test_type == "Wilcoxon Signed-Rank":
            plt.plot(data_store["before"], label="Before", marker='o')
            plt.plot(data_store["after"], label="After", marker='x')
        elif test_type == "ANOVA":
            plt.boxplot(data_store["anova"], labels=["Group 1", "Group 2", "Group 3"])
        elif test_type == "Kolmogorov–Smirnov":
            x = data_store["x1"]
            plt.hist(x, bins=20, alpha=0.7, label="Sample")
            plt.plot(np.sort(x), stats.norm.pdf(np.sort(x), np.mean(x), np.std(x)), label="Normal PDF")
        elif test_type == "Log-rank test (simulated)":
            plt.hist(data_store["survival1"], bins=20, alpha=0.6, label="Group 1")
            plt.hist(data_store["survival2"], bins=20, alpha=0.6, label="Group 2")
        plt.title(f"{test_type} Visualization")
        plt.legend()
        plt.grid(True)
        plt.show()


# Widgets
test_selector = Dropdown(options=test_options, value="t-test", description="Test Type")
n_slider = IntSlider(value=30, min=10, max=100, step=5, description="Sample Size")
mu1_slider = FloatSlider(value=50, min=30, max=70, step=1, description="Mean Group 1")
mu2_slider = FloatSlider(value=55, min=30, max=70, step=1, description="Mean Group 2")

simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_data(n_slider.value, mu1_slider.value, mu2_slider.value))

analyze_btn = Button(description="Analyze", button_style='info')
analyze_btn.on_click(lambda b: analyze_test(test_selector.value))

# Layout
VBox([
    HBox([simulate_btn, analyze_btn]),
    test_selector,
    n_slider,
    mu1_slider,
    mu2_slider,
    out
])

Concept	Population (Parameter)	Sample (Statistic)
Mean	\( \mu = \frac{1}{N} \sum_{i=1}^{N} x_i \)	\( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \)
Variance	\( \sigma^2 = \frac{1}{N} \sum (x_i - \mu)^2 \)	\( s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2 \)
Standard Deviation	\( \sigma \)	\( s \)
Proportion	\( P \)	\( \hat{p} \)

Type of Mean	Formula	Use Case
Arithmetic Mean	\( \bar{x} = \frac{1}{n} \sum x_i \)	General average
Geometric Mean	\( \left( \prod x_i \right)^{1/n} \)	Growth rates, ratios
Harmonic Mean	\( \frac{n}{\sum \frac{1}{x_i}} \)	Rates, speeds
Median	Middle value (sorted data)	Skewed distributions
Mode	Most frequent value	Categorical or discrete data

Type	Formula	Application Example	Limitations
Arithmetic Weighted Mean	$\( \bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} \)$	GPA, average cost, survey scores	Sensitive to outliers and large weights
Geometric Weighted Mean	$\( \bar{x}_g = \prod x_i^{w_i / \sum w_i} \)$	Growth rates, portfolio returns	Cannot handle zero or negative values
Harmonic Weighted Mean	$\( \bar{x}_h = \frac{\sum w_i}{\sum \frac{w_i}{x_i}} \)$	Average speed, rates, efficiency	Undefined for \(x_i = 0\); sensitive to small values
Quadratic Mean (RMS)	$\( \bar{x}_{rms} = \sqrt{ \frac{\sum w_i x_i^2}{\sum w_i} } \)$	Signal processing, error analysis	Overemphasizes large values

Measure	Formula	Interpretation
RMSE (Root Mean Square Error)	$\( \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (O_i - P_i)^2 } \)$	Penalizes large errors; lower is better
MAE (Mean Absolute Error)	$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n}	O_i - P_i
NSE (Nash–Sutcliffe Efficiency)	$\( \text{NSE} = 1 - \frac{ \sum (O_i - P_i)^2 }{ \sum (O_i - \bar{O})^2 } \)$	Compares model to mean of observed; 1 is perfect, 0 is mean model
R² (Coefficient of Determination)	$\( R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} \)$	Proportion of variance explained; closer to 1 is better
Log-Normalized RMSE	$\( \text{Log-RMSE} = \sqrt{ \frac{1}{n} \sum (\log O_i - \log P_i)^2 } \)$	Useful when data spans orders of magnitude; reduces skew from large values
Bias (Mean Error)	$\( \text{Bias} = \frac{1}{n} \sum (P_i - O_i) \)$	Indicates systematic over- or under-prediction
Percent Bias (PBIAS)	$\( \text{PBIAS} = 100 \times \frac{ \sum (O_i - P_i) }{ \sum O_i } \)$	Expresses bias as percentage; ideal value is 0

Test Type	Use Case	Assumptions	Example Scenario
Z-test	Compare means (large \(n\), known \(\sigma\))	Normal distribution, known variance	Mean weight of manufactured parts
t-test	Compare means (small \(n\), unknown \(\sigma\))	Normality, equal variance (for 2-sample)	Drug effectiveness between two groups
Chi-square test	Test for independence or goodness-of-fit	Categorical data, expected counts ≥ 5	Survey response vs. age group
ANOVA	Compare means across ≥3 groups	Normality, equal variance	Crop yield across different fertilizers
Mann–Whitney U	Non-parametric test for two independent groups	No assumption of normality	Median income between regions
Wilcoxon Signed-Rank	Non-parametric test for paired samples	Symmetric distribution of differences	Before/after treatment scores
Kolmogorov–Smirnov	Compare distributions or test normality	Continuous data	Fit of theoretical distribution
Log-rank test	Compare survival curves	Censored data, proportional hazards	Time-to-failure in reliability studies