๐Ÿ“Š Sampling and Descriptive Statistics

Contents

๐Ÿ“Š Sampling and Descriptive Statistics#

Understanding how to summarize and interpret data is foundational in statistics. This module introduces key concepts in sampling, population vs. sample statistics, and types of averages.


๐ŸŽฏ Population vs. Sample Statistics#

Concept

Population (Parameter)

Sample (Statistic)

Mean

\( \mu = \frac{1}{N} \sum_{i=1}^{N} x_i \)

\( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \)

Variance

\( \sigma^2 = \frac{1}{N} \sum (x_i - \mu)^2 \)

\( s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2 \)

Standard Deviation

\( \sigma \)

\( s \)

Proportion

\( P \)

\( \hat{p} \)

  • Population: Entire group of interest

  • Sample: Subset used to estimate population characteristics


๐Ÿงช Sampling Methods#

  • Simple Random Sampling: Every member has equal chance

  • Stratified Sampling: Divide into subgroups and sample proportionally

  • Systematic Sampling: Select every \(k^{th}\) item

  • Cluster Sampling: Randomly select entire groups


๐Ÿ“ˆ Descriptive Statistics#

Descriptive statistics summarize data using:

  • Measures of Central Tendency: Mean, Median, Mode

  • Measures of Dispersion: Range, Variance, Standard Deviation

  • Shape: Skewness, Kurtosis

  • Position: Percentiles, Quartiles


๐Ÿ“Š Comparison of Mean Types#

Type of Mean

Formula

Use Case

Arithmetic Mean

\( \bar{x} = \frac{1}{n} \sum x_i \)

General average

Geometric Mean

\( \left( \prod x_i \right)^{1/n} \)

Growth rates, ratios

Harmonic Mean

\( \frac{n}{\sum \frac{1}{x_i}} \)

Rates, speeds

Median

Middle value (sorted data)

Skewed distributions

Mode

Most frequent value

Categorical or discrete data


โœ… Summary#

  • Sampling allows inference from limited data

  • Descriptive statistics summarize key features

  • Choosing the right average depends on context

  • Sample statistics use \(n-1\) in variance to reduce bias

  • Population statistics assume full data and use \(n\)

# ๐Ÿ“Š Interactive Statistics Module (Unweighted, Row-wise Display)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider

# Synthetic dataset generator
def generate_data(n=30, mean=50, std=10):
    np.random.seed(42)
    data = np.random.normal(loc=mean, scale=std, size=n)
    return np.round(data, 4)  # Round to 4 significant digits

# Statistics calculator
def compute_statistics(n=30, mean=50, std=10):
    x = generate_data(n, mean, std)

    # Sample statistics (ddof=1)
    mean_sample = np.mean(x)
    var_sample = np.var(x, ddof=1)
    std_sample = np.std(x, ddof=1)

    # Population statistics (ddof=0)
    var_pop = np.var(x, ddof=0)
    std_pop = np.std(x, ddof=0)

    # Display statistics
    print("๐Ÿ“˜ Sample Statistics:")
    print(f"Mean: {mean_sample:.4f}")
    print(f"Variance (Sample): {var_sample:.4f}")
    print(f"Standard Deviation (Sample): {std_sample:.4f}\n")

    print("๐Ÿ“— Population Statistics:")
    print(f"Variance (Population): {var_pop:.4f}")
    print(f"Standard Deviation (Population): {std_pop:.4f}\n")

    print("๐Ÿ“‹ Sample Data (25 values per row):")
    for i in range(0, len(x), 25):
        row = x[i:i+25]
        formatted = "  ".join(f"{val:.2f}" for val in row)
        print(formatted)


    # Plot distribution
    plt.figure(figsize=(10, 5))
    plt.hist(x, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
    plt.axvline(mean_sample, color='blue', linestyle='--', label='Sample Mean')
    plt.title("Distribution of Synthetic Sample Data")
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widget
interact(compute_statistics,
         n=IntSlider(min=10, max=100, step=10, value=30, description='Sample Size'),
         mean=FloatSlider(min=30, max=70, step=1, value=50, description='Mean'),
         std=FloatSlider(min=5, max=20, step=1, value=10, description='Std Dev'));

โš–๏ธ Types of Weighted Means: Concepts, Applications, and Limitations#

Weighted means are used when data points contribute unequally to an average. This module explores different types of weighted means, their formulas, use cases, and potential drawbacks.


๐Ÿ“ What Is a Weighted Mean?#

A weighted mean adjusts each valueโ€™s contribution based on its assigned weight:

\[ \bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} \]

Where:

  • \(x_i\) are the data values

  • \(w_i\) are the corresponding weights


๐Ÿงฎ Types of Weighted Means#

Type

Formula

Application Example

Limitations

Arithmetic Weighted Mean

$\( \bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} \)$

GPA, average cost, survey scores

Sensitive to outliers and large weights

Geometric Weighted Mean

$\( \bar{x}_g = \prod x_i^{w_i / \sum w_i} \)$

Growth rates, portfolio returns

Cannot handle zero or negative values

Harmonic Weighted Mean

$\( \bar{x}_h = \frac{\sum w_i}{\sum \frac{w_i}{x_i}} \)$

Average speed, rates, efficiency

Undefined for \(x_i = 0\); sensitive to small values

Quadratic Mean (RMS)

$\( \bar{x}_{rms} = \sqrt{ \frac{\sum w_i x_i^2}{\sum w_i} } \)$

Signal processing, error analysis

Overemphasizes large values


๐Ÿ” Detailed Descriptions#

1. Arithmetic Weighted Mean#

  • Most common form

  • Each value contributes proportionally to its weight

  • Used in education (GPA), economics (weighted price index), and polling

2. Geometric Weighted Mean#

  • Multiplies values raised to weight fractions

  • Ideal for compounding scenarios (e.g., investment returns)

  • Cannot be used with zero or negative values

3. Harmonic Weighted Mean#

  • Inverse of weighted average of reciprocals

  • Best for averaging rates (e.g., speed over varying distances)

  • Sensitive to very small values and undefined for zero

4. Quadratic Mean (Root Mean Square)#

  • Square values before averaging

  • Common in physics and engineering (e.g., RMS voltage)

  • Amplifies large deviations


โš ๏ธ Summary of Limitations#

  • Arithmetic: Skewed by extreme weights or outliers

  • Geometric: Requires all values \(> 0\)

  • Harmonic: Undefined for zero, unstable for small values

  • Quadratic: Not robust to outliers, emphasizes large values


โœ… Choosing the Right Mean#

Scenario

Recommended Mean

Weighted average of scores

Arithmetic Weighted Mean

Compounded growth

Geometric Weighted Mean

Averaging rates or speeds

Harmonic Weighted Mean

Signal or error magnitude

Quadratic Mean (RMS)


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider

# Generate synthetic values
def generate_values(n=30, val_min=10, val_max=100):
    np.random.seed(42)
    return np.round(np.random.uniform(val_min, val_max, size=n), 4)

# Normalize weights
def normalize(weights):
    return weights / np.sum(weights)

# Weighting schemes
def unweighted_weights(n):
    return np.ones(n) / n

def random_weights(n, weight_min=1, weight_max=5):
    raw = np.random.randint(weight_min, weight_max + 1, size=n)
    return normalize(raw)

def center_weights(values, val_min, val_max):
    midpoint = (val_min + val_max) / 2
    distances = np.abs(values - midpoint)
    raw = 1 / (distances + 1e-3)
    return normalize(raw)

# Comparison demo
def compare_all_weights(n=30, val_min=10, val_max=100):
    values = generate_values(n, val_min, val_max)
    w_unweighted = unweighted_weights(n)
    w_random = random_weights(n)
    w_center = center_weights(values, val_min, val_max)

    mean_unweighted = np.average(values, weights=w_unweighted)
    mean_random = np.average(values, weights=w_random)
    mean_center = np.average(values, weights=w_center)

    # Display table
    df = pd.DataFrame({
        'Value': values,
        'Unweighted': np.round(w_unweighted, 4),
        'Random Weight': np.round(w_random, 4),
        'Center Weight': np.round(w_center, 4)
    })
    print(df)

    print(f"\nโš–๏ธ Mean Comparisons:")
    print(f"๐Ÿ”น Unweighted Mean: {mean_unweighted:.4f}")
    print(f"๐Ÿ”น Random Weighted Mean: {mean_random:.4f}")
    print(f"๐Ÿ”น Center-Weighted Mean: {mean_center:.4f}")

    # Plot
    plt.figure(figsize=(12, 6))
    plt.scatter(values, np.zeros_like(values), s=w_unweighted * 1000, c='gray', alpha=0.4, label='Unweighted')
    plt.scatter(values, np.ones_like(values), s=w_random * 1000, c='blue', alpha=0.5, label='Random Weights')
    plt.scatter(values, np.ones_like(values) * 2, s=w_center * 1000, c='orange', alpha=0.5, label='Center Weights')
    plt.axvline(mean_unweighted, color='gray', linestyle='--', label='Unweighted Mean')
    plt.axvline(mean_random, color='blue', linestyle='--', label='Random Mean')
    plt.axvline(mean_center, color='orange', linestyle='--', label='Center Mean')
    plt.yticks([0, 1, 2], ['Unweighted', 'Random', 'Center'])
    plt.title("Comparison of Weighting Schemes (Normalized)")
    plt.xlabel("Value")
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widget
interact(compare_all_weights,
         n=IntSlider(min=10, max=100, step=10, value=30, description='Sample Size'),
         val_min=IntSlider(min=0, max=50, step=5, value=10, description='Min Value'),
         val_max=IntSlider(min=50, max=150, step=10, value=100, description='Max Value'));

๐Ÿ“ Goodness-of-Fit Measures: Overview and Interpretation#

Goodness-of-fit metrics quantify how well a model or simulation replicates observed data. These measures help evaluate prediction accuracy, error magnitude, and overall model reliability.


๐Ÿ”ข Common Goodness-of-Fit Metrics#

Measure

Formula

Interpretation

RMSE (Root Mean Square Error)

$\( \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (O_i - P_i)^2 } \)$

Penalizes large errors; lower is better

MAE (Mean Absolute Error)

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n}

O_i - P_i

NSE (Nashโ€“Sutcliffe Efficiency)

$\( \text{NSE} = 1 - \frac{ \sum (O_i - P_i)^2 }{ \sum (O_i - \bar{O})^2 } \)$

Compares model to mean of observed; 1 is perfect, 0 is mean model

Rยฒ (Coefficient of Determination)

$\( R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} \)$

Proportion of variance explained; closer to 1 is better

Log-Normalized RMSE

$\( \text{Log-RMSE} = \sqrt{ \frac{1}{n} \sum (\log O_i - \log P_i)^2 } \)$

Useful when data spans orders of magnitude; reduces skew from large values

Bias (Mean Error)

$\( \text{Bias} = \frac{1}{n} \sum (P_i - O_i) \)$

Indicates systematic over- or under-prediction

Percent Bias (PBIAS)

$\( \text{PBIAS} = 100 \times \frac{ \sum (O_i - P_i) }{ \sum O_i } \)$

Expresses bias as percentage; ideal value is 0


๐Ÿ“˜ Interpretation Guidelines#

Metric

Ideal Value

Notes

RMSE, MAE

โ†’ 0

Lower values indicate better fit

NSE, Rยฒ

โ†’ 1

Higher values indicate better predictive power

Bias, PBIAS

โ†’ 0

Positive = overprediction; Negative = underprediction

Log-RMSE

Contextual

Useful for log-transformed or multiplicative error models


๐Ÿงช Use Cases#

Scenario

Recommended Metric(s)

General error magnitude

RMSE, MAE

Hydrology / environmental models

NSE, PBIAS

Log-transformed or skewed data

Log-RMSE

Model bias detection

Bias, Percent Bias

Regression performance

Rยฒ, RMSE


โš ๏ธ Limitations#

  • RMSE is sensitive to large errors (outliers).

  • NSE can be misleading if observed variance is low.

  • Log-RMSE requires strictly positive values.

  • Rยฒ does not indicate bias or error magnitude.

import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider, Output

# Output widget
out = Output()

# Standard growth model (e.g., logistic-like)
def generate_observed_growth(n, r=0.2, K=100, P0=10):
    t = np.arange(n)
    observed = K / (1 + ((K - P0) / P0) * np.exp(-r * t))
    return t, observed

# Simulated growth model (user-controlled)
def generate_simulated_growth(n, r_sim=0.25, K_sim=95, P0_sim=12):
    t = np.arange(n)
    simulated = K_sim / (1 + ((K_sim - P0_sim) / P0_sim) * np.exp(-r_sim * t))
    return simulated

# Goodness-of-fit metrics
def compute_metrics(observed, predicted):
    eps = 1e-6
    mean_obs = np.mean(observed)
    residuals = observed - predicted

    rmse = np.sqrt(np.mean(residuals**2))
    mae = np.mean(np.abs(residuals))
    bias = np.mean(predicted - observed)
    pbias = 100 * np.sum(observed - predicted) / np.sum(observed)
    nse = 1 - np.sum(residuals**2) / np.sum((observed - mean_obs)**2)
    r2 = 1 - np.sum(residuals**2) / np.sum((observed - mean_obs)**2)
    log_rmse = np.sqrt(np.mean((np.log(observed + eps) - np.log(predicted + eps))**2))

    return {
        "RMSE": rmse,
        "MAE": mae,
        "Bias": bias,
        "PBIAS (%)": pbias,
        "NSE": nse,
        "Rยฒ": r2,
        "Log-RMSE": log_rmse
    }

# Interactive comparison
def compare_growth_models(n=30, r_sim=0.25, K_sim=95, P0_sim=12):
    t, observed = generate_observed_growth(n)
    simulated = generate_simulated_growth(n, r_sim, K_sim, P0_sim)
    metrics = compute_metrics(observed, simulated)

    out.clear_output()
    with out:
        print("๐Ÿ“˜ Goodness-of-Fit Metrics:")
        for k, v in metrics.items():
            print(f"{k}: {v:.4f}")

        print("\n๐Ÿ“— Interpretation:")
        print("โœ… Lower RMSE/MAE โ†’ better fit")
        print("โœ… NSE and Rยฒ close to 1 โ†’ strong predictive power")
        print("โœ… Bias and PBIAS near 0 โ†’ minimal systematic error")
        print("โœ… Log-RMSE useful for skewed or multiplicative data")

        # Plot comparison
        plt.figure(figsize=(10, 5))
        plt.plot(t, observed, label="Observed Growth", marker='o')
        plt.plot(t, simulated, label="Simulated Growth", marker='x')
        plt.title("Observed vs. Simulated Growth Model")
        plt.xlabel("Time")
        plt.ylabel("Population")
        plt.legend()
        plt.grid(True)
        plt.show()

# Interactive controls
interact(
    compare_growth_models,
    n=IntSlider(min=10, max=100, step=5, value=30, description="Time Steps"),
    r_sim=FloatSlider(min=0.1, max=0.5, step=0.01, value=0.25, description="Growth Rate"),
    K_sim=FloatSlider(min=50, max=150, step=5, value=95, description="Carrying Capacity"),
    P0_sim=FloatSlider(min=5, max=30, step=1, value=12, description="Initial Population")
)

out

๐ŸŽฒ Understanding Probability and Its Role in Risk Assessment#

๐Ÿ“˜ What Is Probability?#

  • 1 means the event is certain.

  • Values between 0 and 1 reflect varying degrees of uncertainty.

  • Probability is foundational to:

    • Statistics

    • Decision-making

    • Machine learning

    • Engineering risk analysis


๐Ÿ” Why Is Probability Important?#

Domain

Role of Probability

Statistics

Models randomness in sampling, inference, and prediction.

Engineering

Assesses reliability, failure rates, and safety margins.

Computer Science

Powers algorithms in AI, cryptography, and simulations.

Finance

Quantifies market risk, portfolio volatility, and insurance models.

Environmental Science

Forecasts extreme events like floods, droughts, and pollutant dispersion.


โš–๏ธ Probability in Risk Assessment#

Risk assessment involves estimating the likelihood and impact of adverse outcomes. Probability helps:

  • Model uncertainty in system behavior (e.g., soil failure, structural collapse).

  • Quantify exposure to hazards (e.g., earthquakes, contamination).

  • Support decision-making under uncertainty (e.g., cost-benefit analysis, safety thresholds).


๐Ÿงช Example: Geotechnical Risk#

In slope stability analysis:

  • Probability of failure is estimated using soil strength variability.

  • Monte Carlo simulations or probabilistic limit equilibrium methods are used.

  • Risk = Probability ร— Consequence


๐Ÿ“Š Key Probability Concepts#

Concept

Description

Experiment

A repeatable process with uncertain outcome (e.g., rolling a die).

Sample Space

The set of all possible outcomes.

Event

A subset of outcomes (e.g., rolling an even number).

Probability Rule

\(P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total outcomes}}\)

Complement Rule

\(P(\text{not } A) = 1 - P(A)\)

Addition Rule

\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

Multiplication Rule

$P(A \cap B) = P(A) \cdot P(B


๐ŸŽฏ Interactive Simulation: Empirical Probability and Return Period#

This simulation demonstrates how to estimate the empirical probability of an event exceeding a threshold and compute its return period using synthetic data.

๐Ÿ” What the Code Does#

  • Generates synthetic event magnitudes using a uniform distribution between val_min and val_max.

  • Allows the user to select a threshold value.

  • Calculates:

    • The number of events that exceed the threshold.

    • The empirical probability of exceedance:
      \(P(\text{event}) = \frac{\text{Number of exceedances}}{\text{Total events}}\)

    • The return period (or recurrence interval):
      \(\text{Return Period} = \frac{1}{P(\text{event})}\)

๐Ÿ“ˆ Visualization#

  • A histogram shows the distribution of event magnitudes.

  • A red dashed line marks the selected threshold.

  • Summary statistics are printed to reinforce interpretation.

๐Ÿงช Applications#

This simulation is useful for exploring:

  • Flood frequency analysis

  • Earthquake magnitude exceedance

  • Slope failure probability

  • Any domain where event magnitude vs. threshold matters

โš™๏ธ Interactive Controls#

  • Sample Size: Number of synthetic events generated.

  • Min Value and Max Value: Range of event magnitudes.

  • Threshold: Value used to define a โ€œcriticalโ€ or โ€œextremeโ€ event.

๐Ÿ“Š Example Use Case#

If you set:

  • val_min = 0, val_max = 100

  • threshold = 80

  • n = 100

The simulation will estimate how often events exceed 80 and how frequently such events are expected to recur.


Would you like to extend this with time-series simulation or allow multiple thresholds for comparative risk analysis?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider, FloatSlider

# Generate synthetic time-series event magnitudes
def generate_time_series(n=100, val_min=0, val_max=100):
    np.random.seed(42)
    magnitudes = np.round(np.random.uniform(val_min, val_max, size=n), 2)
    time = np.arange(1, n + 1)
    return pd.DataFrame({'Time': time, 'Magnitude': magnitudes})

# Simulation with multiple thresholds
def multi_threshold_demo(n=100, val_min=0, val_max=100, t1=60, t2=80, t3=100):
    df = generate_time_series(n, val_min, val_max)
    thresholds = [t1, t2, t3]
    results = []

    for t in thresholds:
        exceed = df['Magnitude'] > t
        prob = exceed.sum() / n
        rp = 1 / prob if prob > 0 else np.inf
        results.append((t, prob, rp))
        df[f'> {t}'] = exceed

    # Display results
    print("๐Ÿ“Š Empirical Probability and Return Periods:")
    for t, p, r in results:
        print(f"Threshold {t:>3}:  Probability = {p:.4f},  Return Period = {r:.2f} events")

    # Plot time series
    plt.figure(figsize=(12, 5))
    plt.plot(df['Time'], df['Magnitude'], marker='o', linestyle='-', color='gray', label='Event Magnitude')
    for t in thresholds:
        plt.axhline(t, linestyle='--', label=f'Threshold {t}')
    plt.title("Synthetic Time-Series of Event Magnitudes")
    plt.xlabel("Time")
    plt.ylabel("Magnitude")
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widget
interact(multi_threshold_demo,
         n=IntSlider(min=50, max=500, step=50, value=100, description='Sample Size'),
         val_min=IntSlider(min=0, max=50, step=5, value=0, description='Min Value'),
         val_max=IntSlider(min=50, max=150, step=10, value=100, description='Max Value'),
         t1=FloatSlider(min=0, max=150, step=1, value=60, description='Threshold 1'),
         t2=FloatSlider(min=0, max=150, step=1, value=80, description='Threshold 2'),
         t3=FloatSlider(min=0, max=150, step=1, value=100, description='Threshold 3'));

๐Ÿ“Š Using Theoretical Distributions in Risk and Probability Analysis#

When empirical data is scarce, unreliable, or incomplete, we turn to theoretical probability distributions to model uncertainty and estimate risk. These distributions are based on mathematical assumptions and are widely used in engineering, environmental science, and decision analysis.


๐Ÿ“˜ Why Use Theoretical Distributions?#

  • Empirical distributions rely on observed data, which may be:

    • Too sparse (e.g., rare events like major earthquakes)

    • Incomplete (e.g., short monitoring periods)

    • Biased or noisy

  • Theoretical distributions provide:

    • A structured way to model uncertainty

    • Flexibility to simulate rare or extreme events

    • A foundation for probabilistic design and risk assessment


๐Ÿ“ Common Theoretical Distributions#

Distribution

Typical Use Case

Shape Characteristics

Normal

Measurement errors, material properties

Symmetric, bell-shaped

Lognormal

Soil permeability, rainfall intensity

Skewed right, non-negative

Exponential

Time between failures, hazard occurrence

Memoryless, decreasing tail

Weibull

Reliability, fatigue life

Flexible shape for increasing/decreasing hazard

Gumbel

Extreme value modeling (e.g., floods, wind)

Models maxima/minima of distributions

Binomial

Discrete event counts (e.g., success/failure)

Finite trials, fixed probability

Poisson

Rare event frequency (e.g., landslides/year)

Count-based, low probability


โš–๏ธ Application in Risk Assessment#

Theoretical distributions help estimate:

  • Probability of exceedance:
    \(P(X > x)\) for a critical threshold \(x\)

  • Return period:
    \(\text{Return Period} = \frac{1}{P(\text{event})}\)

  • Confidence intervals for uncertain parameters

  • Failure probabilities in reliability models


๐Ÿงช Example: Flood Risk Using Gumbel Distribution#

If historical flood data is limited, we can fit a Gumbel distribution to estimate:

  • Probability of a flood exceeding 500 mยณ/s

  • Expected return period of such an event

  • Design thresholds for levees or drainage systems


๐Ÿง  Key Takeaways#

  • Theoretical distributions are essential when empirical data is insufficient.

  • Choice of distribution depends on the nature of the variable (continuous vs. discrete, symmetric vs. skewed).

  • They enable simulation, forecasting, and probabilistic design under uncertainty.


Distribution fitting and goodness of fit#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from ipywidgets import interact, SelectMultiple, FloatSlider, IntSlider, Button, VBox, HBox, Output

# Output widget for dynamic display
out = Output()

# Available distributions
available_distributions = {
    "Normal": stats.norm,
    "Lognormal": stats.lognorm,
    "Exponential": stats.expon,
    "Weibull": stats.weibull_min
}

# Global data container
data = None

# Simulation function
def simulate_data(n=500, shape=2.0, scale=10.0):
    global data
    np.random.seed(42)
    data = np.random.gamma(shape=shape, scale=scale, size=n)

# Fit and analyze selected distributions
def analyze(selected, threshold):
    out.clear_output()
    with out:
        if data is None:
            print("โš ๏ธ Please click 'Simulate' first.")
            return

        x = np.linspace(min(data), max(data), 100)
        fit_results = []

        # Empirical exceedance
        exceed_emp = data > threshold
        prob_emp = exceed_emp.sum() / len(data)
        rp_emp = 1 / prob_emp if prob_emp > 0 else np.inf

        print(f"๐Ÿ“Š Empirical Probability of Exceeding {threshold:.2f}: {prob_emp:.4f}")
        print(f"๐Ÿ” Empirical Return Period: {rp_emp:.2f} events\n")

        for name in selected:
            dist = available_distributions[name]
            params = dist.fit(data)
            pdf_fitted = dist.pdf(x, *params)
            ks_stat, ks_p = stats.kstest(data, dist.name, args=params)
            log_likelihood = np.sum(dist.logpdf(data, *params))
            k = len(params)
            aic = 2 * k - 2 * log_likelihood

            # Confidence intervals (approximate using bootstrap)
            ci_lower = np.percentile(data, 2.5)
            ci_upper = np.percentile(data, 97.5)

            # Theoretical exceedance
            prob_theo = dist.sf(threshold, *params)
            rp_theo = 1 / prob_theo if prob_theo > 0 else np.inf

            fit_results.append({
                "Distribution": name,
                "KS Statistic": round(ks_stat, 4),
                "KS p-value": round(ks_p, 4),
                "AIC": round(aic, 2),
                "95% CI": f"[{ci_lower:.2f}, {ci_upper:.2f}]",
                "P(X > threshold)": round(prob_theo, 4),
                "Return Period": round(rp_theo, 2)
            })

            # Plot
            plt.figure(figsize=(6, 4))
            plt.hist(data, bins=30, density=True, alpha=0.5, label="Empirical")
            plt.plot(x, pdf_fitted, label=f"{name} Fit", linewidth=2)
            plt.axvline(threshold, color='red', linestyle='--', label=f"Threshold = {threshold}")
            plt.title(f"{name} Distribution Fit")
            plt.xlabel("Value")
            plt.ylabel("Density")
            plt.legend()
            plt.grid(True)
            plt.show()

        # Summary table
        df_results = pd.DataFrame(fit_results)
        print("๐Ÿ“Š Goodness-of-Fit and Risk Summary:")
        display(df_results.sort_values("AIC"))

# Widgets
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_data())

distribution_selector = SelectMultiple(
    options=list(available_distributions.keys()),
    value=["Normal", "Lognormal"],
    description="Distributions"
)

threshold_slider = FloatSlider(
    value=80,
    min=0,
    max=200,
    step=1,
    description="Threshold"
)

analyze_btn = Button(description="Analyze", button_style='info')
analyze_btn.on_click(lambda b: analyze(distribution_selector.value, threshold_slider.value))

# Layout
VBox([
    HBox([simulate_btn, analyze_btn]),
    distribution_selector,
    threshold_slider,
    out
])

๐Ÿงช Hypothesis Testing: Concepts, Types, and Applications#

Hypothesis testing is a statistical method used to make decisions or inferences about population parameters based on sample data.


๐ŸŽฏ Core Concepts#

  • Null Hypothesis (\(H_0\)): Assumes no effect or no difference.

  • Alternative Hypothesis (\(H_1\) or \(H_a\)): Assumes a real effect or difference exists.

  • Test Statistic: A value computed from sample data used to decide whether to reject \(H_0\).

  • p-value: Probability of observing the test statistic under \(H_0\).

  • Significance Level (\(\alpha\)): Threshold for rejecting \(H_0\), commonly 0.05.

  • Confidence Interval: Range of values likely to contain the true parameter.


๐Ÿ” Types of Hypothesis Tests#

Test Type

Use Case

Assumptions

Example Scenario

Z-test

Compare means (large \(n\), known \(\sigma\))

Normal distribution, known variance

Mean weight of manufactured parts

t-test

Compare means (small \(n\), unknown \(\sigma\))

Normality, equal variance (for 2-sample)

Drug effectiveness between two groups

Chi-square test

Test for independence or goodness-of-fit

Categorical data, expected counts โ‰ฅ 5

Survey response vs. age group

ANOVA

Compare means across โ‰ฅ3 groups

Normality, equal variance

Crop yield across different fertilizers

Mannโ€“Whitney U

Non-parametric test for two independent groups

No assumption of normality

Median income between regions

Wilcoxon Signed-Rank

Non-parametric test for paired samples

Symmetric distribution of differences

Before/after treatment scores

Kolmogorovโ€“Smirnov

Compare distributions or test normality

Continuous data

Fit of theoretical distribution

Log-rank test

Compare survival curves

Censored data, proportional hazards

Time-to-failure in reliability studies


๐Ÿ“Š Decision Process#

  1. Formulate hypotheses: \(H_0\) and \(H_1\)

  2. Choose significance level: \(\alpha = 0.05\)

  3. Select appropriate test based on data type and assumptions

  4. Compute test statistic and p-value

  5. Compare p-value to \(\alpha\):

    • If \(p < \alpha\): Reject \(H_0\)

    • If \(p \geq \alpha\): Fail to reject \(H_0\)


๐Ÿง  Applications in Engineering and Science#

  • Quality control: Z-test, t-test for process monitoring

  • Environmental analysis: ANOVA for comparing pollutant levels

  • Reliability: Log-rank test for component lifetimes

  • Geotechnical studies: Non-parametric tests for soil properties

  • Remote sensing: Chi-square for classification accuracy


Would you like to extend this with interactive examples or code snippets for each test?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
from ipywidgets import interact, IntSlider, FloatSlider, Button, Output, VBox, HBox

# Output widget
out = Output()

# Global data
group1 = None
group2 = None

# Simulate data
def simulate_ttest(n=30, mu1=50, mu2=55, sigma=10):
    global group1, group2
    np.random.seed(42)
    group1 = np.random.normal(loc=mu1, scale=sigma, size=n)
    group2 = np.random.normal(loc=mu2, scale=sigma, size=n)

# Run t-test and plot
def run_ttest():
    out.clear_output()
    with out:
        if group1 is None or group2 is None:
            print("โš ๏ธ Please simulate data first.")
            return

        # Perform t-test
        stat, p = ttest_ind(group1, group2)
        mean1, mean2 = np.mean(group1), np.mean(group2)

        # Plot
        plt.figure(figsize=(6, 4))
        plt.hist(group1, bins=20, alpha=0.6, label="Group 1", color='blue')
        plt.hist(group2, bins=20, alpha=0.6, label="Group 2", color='orange')
        plt.axvline(mean1, color='blue', linestyle='--', label=f"Mean 1 = {mean1:.2f}")
        plt.axvline(mean2, color='orange', linestyle='--', label=f"Mean 2 = {mean2:.2f}")
        plt.title("Distribution of Two Groups")
        plt.xlabel("Value")
        plt.ylabel("Frequency")
        plt.legend()
        plt.grid(True)
        plt.show()

        # Summary
        print("๐Ÿ“Š Two-Sample t-Test Results:")
        print(f"t-statistic: {stat:.4f}")
        print(f"p-value: {p:.4f}")
        if p < 0.05:
            print("โœ… Significant difference (reject Hโ‚€)")
        else:
            print("โŒ No significant difference (fail to reject Hโ‚€)")

# Widgets
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_ttest(n_slider.value, mu1_slider.value, mu2_slider.value))

run_btn = Button(description="Run t-Test", button_style='info')
run_btn.on_click(lambda b: run_ttest())

n_slider = IntSlider(value=30, min=10, max=100, step=5, description="Sample Size")
mu1_slider = FloatSlider(value=50, min=30, max=70, step=1, description="Mean Group 1")
mu2_slider = FloatSlider(value=55, min=30, max=70, step=1, description="Mean Group 2")

# Layout
VBox([
    HBox([simulate_btn, run_btn]),
    n_slider,
    mu1_slider,
    mu2_slider,
    out
])

Hypothesis Testing#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from ipywidgets import Dropdown, IntSlider, FloatSlider, Button, Output, VBox, HBox

# Output widget
out = Output()

# Global data containers
data_store = {}

# Test options
test_options = [
    "Z-test",
    "t-test",
    "Chi-square test",
    "ANOVA",
    "Mannโ€“Whitney U",
    "Wilcoxon Signed-Rank",
    "Kolmogorovโ€“Smirnov",
    "Log-rank test (simulated)"
]

# Simulate data
def simulate_data(n, mu1, mu2):
    np.random.seed(42)
    data_store.clear()
    data_store["x1"] = np.random.normal(mu1, 10, n)
    data_store["x2"] = np.random.normal(mu2, 10, n)
    data_store["before"] = np.random.normal(mu1, 10, n)
    data_store["after"] = data_store["before"] + np.random.normal(mu2 - mu1, 5, n)
    data_store["anova"] = [
        np.random.normal(50, 10, n),
        np.random.normal(55, 10, n),
        np.random.normal(60, 10, n)
    ]
    data_store["survival1"] = np.random.exponential(scale=10, size=n)
    data_store["survival2"] = np.random.exponential(scale=15, size=n)

# Analyze selected test
def analyze_test(test_type):
    out.clear_output()
    with out:
        if not data_store:
            print("โš ๏ธ Please click 'Simulate' first.")
            return

        def interpret(p, alpha=0.05):
            if p < alpha:
                return f"โœ… Result is statistically significant (p < {alpha}). Reject the null hypothesis."
            else:
                return f"โŒ Result is not statistically significant (p โ‰ฅ {alpha}). Fail to reject the null hypothesis."

        if test_type == "Z-test":
            mu = 10
            x1, x2 = data_store["x1"], data_store["x2"]
            z = (np.mean(x1) - np.mean(x2)) / np.sqrt(mu**2/len(x1) + mu**2/len(x2))
            p = 2 * (1 - stats.norm.cdf(abs(z)))
            print(f"Z-test: z = {z:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "t-test":
            stat, p = stats.ttest_ind(data_store["x1"], data_store["x2"])
            print(f"t-test: t = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Chi-square test":
            observed = np.array([[20, 30], [25, 25]])
            stat, p, _, _ = stats.chi2_contingency(observed)
            print(f"Chi-square test: ฯ‡ยฒ = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "ANOVA":
            stat, p = stats.f_oneway(*data_store["anova"])
            print(f"ANOVA: F = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Mannโ€“Whitney U":
            stat, p = stats.mannwhitneyu(data_store["x1"], data_store["x2"])
            print(f"Mannโ€“Whitney U: U = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Wilcoxon Signed-Rank":
            stat, p = stats.wilcoxon(data_store["before"], data_store["after"])
            print(f"Wilcoxon Signed-Rank: W = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Kolmogorovโ€“Smirnov":
            x = data_store["x1"]
            stat, p = stats.kstest(x, 'norm', args=(np.mean(x), np.std(x)))
            print(f"Kolmogorovโ€“Smirnov: D = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        elif test_type == "Log-rank test (simulated)":
            stat, p = stats.ttest_ind(data_store["survival1"], data_store["survival2"])
            print(f"Log-rank (simulated): t = {stat:.4f}, p = {p:.4f}")
            print(interpret(p))

        # Optional plot
        plt.figure(figsize=(6, 4))
        if test_type in ["Z-test", "t-test", "Mannโ€“Whitney U"]:
            plt.hist(data_store["x1"], bins=20, alpha=0.6, label="Group 1")
            plt.hist(data_store["x2"], bins=20, alpha=0.6, label="Group 2")
        elif test_type == "Wilcoxon Signed-Rank":
            plt.plot(data_store["before"], label="Before", marker='o')
            plt.plot(data_store["after"], label="After", marker='x')
        elif test_type == "ANOVA":
            plt.boxplot(data_store["anova"], labels=["Group 1", "Group 2", "Group 3"])
        elif test_type == "Kolmogorovโ€“Smirnov":
            x = data_store["x1"]
            plt.hist(x, bins=20, alpha=0.7, label="Sample")
            plt.plot(np.sort(x), stats.norm.pdf(np.sort(x), np.mean(x), np.std(x)), label="Normal PDF")
        elif test_type == "Log-rank test (simulated)":
            plt.hist(data_store["survival1"], bins=20, alpha=0.6, label="Group 1")
            plt.hist(data_store["survival2"], bins=20, alpha=0.6, label="Group 2")
        plt.title(f"{test_type} Visualization")
        plt.legend()
        plt.grid(True)
        plt.show()


# Widgets
test_selector = Dropdown(options=test_options, value="t-test", description="Test Type")
n_slider = IntSlider(value=30, min=10, max=100, step=5, description="Sample Size")
mu1_slider = FloatSlider(value=50, min=30, max=70, step=1, description="Mean Group 1")
mu2_slider = FloatSlider(value=55, min=30, max=70, step=1, description="Mean Group 2")

simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_data(n_slider.value, mu1_slider.value, mu2_slider.value))

analyze_btn = Button(description="Analyze", button_style='info')
analyze_btn.on_click(lambda b: analyze_test(test_selector.value))

# Layout
VBox([
    HBox([simulate_btn, analyze_btn]),
    test_selector,
    n_slider,
    mu1_slider,
    mu2_slider,
    out
])