๐ Sampling and Descriptive Statistics#
Understanding how to summarize and interpret data is foundational in statistics. This module introduces key concepts in sampling, population vs. sample statistics, and types of averages.
๐ฏ Population vs. Sample Statistics#
Concept |
Population (Parameter) |
Sample (Statistic) |
---|---|---|
Mean |
\( \mu = \frac{1}{N} \sum_{i=1}^{N} x_i \) |
\( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \) |
Variance |
\( \sigma^2 = \frac{1}{N} \sum (x_i - \mu)^2 \) |
\( s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2 \) |
Standard Deviation |
\( \sigma \) |
\( s \) |
Proportion |
\( P \) |
\( \hat{p} \) |
Population: Entire group of interest
Sample: Subset used to estimate population characteristics
๐งช Sampling Methods#
Simple Random Sampling: Every member has equal chance
Stratified Sampling: Divide into subgroups and sample proportionally
Systematic Sampling: Select every \(k^{th}\) item
Cluster Sampling: Randomly select entire groups
๐ Descriptive Statistics#
Descriptive statistics summarize data using:
Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Range, Variance, Standard Deviation
Shape: Skewness, Kurtosis
Position: Percentiles, Quartiles
๐ Comparison of Mean Types#
Type of Mean |
Formula |
Use Case |
---|---|---|
Arithmetic Mean |
\( \bar{x} = \frac{1}{n} \sum x_i \) |
General average |
Geometric Mean |
\( \left( \prod x_i \right)^{1/n} \) |
Growth rates, ratios |
Harmonic Mean |
\( \frac{n}{\sum \frac{1}{x_i}} \) |
Rates, speeds |
Median |
Middle value (sorted data) |
Skewed distributions |
Mode |
Most frequent value |
Categorical or discrete data |
โ Summary#
Sampling allows inference from limited data
Descriptive statistics summarize key features
Choosing the right average depends on context
Sample statistics use \(n-1\) in variance to reduce bias
Population statistics assume full data and use \(n\)
# ๐ Interactive Statistics Module (Unweighted, Row-wise Display)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider
# Synthetic dataset generator
def generate_data(n=30, mean=50, std=10):
np.random.seed(42)
data = np.random.normal(loc=mean, scale=std, size=n)
return np.round(data, 4) # Round to 4 significant digits
# Statistics calculator
def compute_statistics(n=30, mean=50, std=10):
x = generate_data(n, mean, std)
# Sample statistics (ddof=1)
mean_sample = np.mean(x)
var_sample = np.var(x, ddof=1)
std_sample = np.std(x, ddof=1)
# Population statistics (ddof=0)
var_pop = np.var(x, ddof=0)
std_pop = np.std(x, ddof=0)
# Display statistics
print("๐ Sample Statistics:")
print(f"Mean: {mean_sample:.4f}")
print(f"Variance (Sample): {var_sample:.4f}")
print(f"Standard Deviation (Sample): {std_sample:.4f}\n")
print("๐ Population Statistics:")
print(f"Variance (Population): {var_pop:.4f}")
print(f"Standard Deviation (Population): {std_pop:.4f}\n")
print("๐ Sample Data (25 values per row):")
for i in range(0, len(x), 25):
row = x[i:i+25]
formatted = " ".join(f"{val:.2f}" for val in row)
print(formatted)
# Plot distribution
plt.figure(figsize=(10, 5))
plt.hist(x, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(mean_sample, color='blue', linestyle='--', label='Sample Mean')
plt.title("Distribution of Synthetic Sample Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.grid(True)
plt.show()
# Interactive widget
interact(compute_statistics,
n=IntSlider(min=10, max=100, step=10, value=30, description='Sample Size'),
mean=FloatSlider(min=30, max=70, step=1, value=50, description='Mean'),
std=FloatSlider(min=5, max=20, step=1, value=10, description='Std Dev'));
โ๏ธ Types of Weighted Means: Concepts, Applications, and Limitations#
Weighted means are used when data points contribute unequally to an average. This module explores different types of weighted means, their formulas, use cases, and potential drawbacks.
๐ What Is a Weighted Mean?#
A weighted mean adjusts each valueโs contribution based on its assigned weight:
Where:
\(x_i\) are the data values
\(w_i\) are the corresponding weights
๐งฎ Types of Weighted Means#
Type |
Formula |
Application Example |
Limitations |
---|---|---|---|
Arithmetic Weighted Mean |
$\( \bar{x}_w = \frac{\sum w_i x_i}{\sum w_i} \)$ |
GPA, average cost, survey scores |
Sensitive to outliers and large weights |
Geometric Weighted Mean |
$\( \bar{x}_g = \prod x_i^{w_i / \sum w_i} \)$ |
Growth rates, portfolio returns |
Cannot handle zero or negative values |
Harmonic Weighted Mean |
$\( \bar{x}_h = \frac{\sum w_i}{\sum \frac{w_i}{x_i}} \)$ |
Average speed, rates, efficiency |
Undefined for \(x_i = 0\); sensitive to small values |
Quadratic Mean (RMS) |
$\( \bar{x}_{rms} = \sqrt{ \frac{\sum w_i x_i^2}{\sum w_i} } \)$ |
Signal processing, error analysis |
Overemphasizes large values |
๐ Detailed Descriptions#
1. Arithmetic Weighted Mean#
Most common form
Each value contributes proportionally to its weight
Used in education (GPA), economics (weighted price index), and polling
2. Geometric Weighted Mean#
Multiplies values raised to weight fractions
Ideal for compounding scenarios (e.g., investment returns)
Cannot be used with zero or negative values
3. Harmonic Weighted Mean#
Inverse of weighted average of reciprocals
Best for averaging rates (e.g., speed over varying distances)
Sensitive to very small values and undefined for zero
4. Quadratic Mean (Root Mean Square)#
Square values before averaging
Common in physics and engineering (e.g., RMS voltage)
Amplifies large deviations
โ ๏ธ Summary of Limitations#
Arithmetic: Skewed by extreme weights or outliers
Geometric: Requires all values \(> 0\)
Harmonic: Undefined for zero, unstable for small values
Quadratic: Not robust to outliers, emphasizes large values
โ Choosing the Right Mean#
Scenario |
Recommended Mean |
---|---|
Weighted average of scores |
Arithmetic Weighted Mean |
Compounded growth |
Geometric Weighted Mean |
Averaging rates or speeds |
Harmonic Weighted Mean |
Signal or error magnitude |
Quadratic Mean (RMS) |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider
# Generate synthetic values
def generate_values(n=30, val_min=10, val_max=100):
np.random.seed(42)
return np.round(np.random.uniform(val_min, val_max, size=n), 4)
# Normalize weights
def normalize(weights):
return weights / np.sum(weights)
# Weighting schemes
def unweighted_weights(n):
return np.ones(n) / n
def random_weights(n, weight_min=1, weight_max=5):
raw = np.random.randint(weight_min, weight_max + 1, size=n)
return normalize(raw)
def center_weights(values, val_min, val_max):
midpoint = (val_min + val_max) / 2
distances = np.abs(values - midpoint)
raw = 1 / (distances + 1e-3)
return normalize(raw)
# Comparison demo
def compare_all_weights(n=30, val_min=10, val_max=100):
values = generate_values(n, val_min, val_max)
w_unweighted = unweighted_weights(n)
w_random = random_weights(n)
w_center = center_weights(values, val_min, val_max)
mean_unweighted = np.average(values, weights=w_unweighted)
mean_random = np.average(values, weights=w_random)
mean_center = np.average(values, weights=w_center)
# Display table
df = pd.DataFrame({
'Value': values,
'Unweighted': np.round(w_unweighted, 4),
'Random Weight': np.round(w_random, 4),
'Center Weight': np.round(w_center, 4)
})
print(df)
print(f"\nโ๏ธ Mean Comparisons:")
print(f"๐น Unweighted Mean: {mean_unweighted:.4f}")
print(f"๐น Random Weighted Mean: {mean_random:.4f}")
print(f"๐น Center-Weighted Mean: {mean_center:.4f}")
# Plot
plt.figure(figsize=(12, 6))
plt.scatter(values, np.zeros_like(values), s=w_unweighted * 1000, c='gray', alpha=0.4, label='Unweighted')
plt.scatter(values, np.ones_like(values), s=w_random * 1000, c='blue', alpha=0.5, label='Random Weights')
plt.scatter(values, np.ones_like(values) * 2, s=w_center * 1000, c='orange', alpha=0.5, label='Center Weights')
plt.axvline(mean_unweighted, color='gray', linestyle='--', label='Unweighted Mean')
plt.axvline(mean_random, color='blue', linestyle='--', label='Random Mean')
plt.axvline(mean_center, color='orange', linestyle='--', label='Center Mean')
plt.yticks([0, 1, 2], ['Unweighted', 'Random', 'Center'])
plt.title("Comparison of Weighting Schemes (Normalized)")
plt.xlabel("Value")
plt.legend()
plt.grid(True)
plt.show()
# Interactive widget
interact(compare_all_weights,
n=IntSlider(min=10, max=100, step=10, value=30, description='Sample Size'),
val_min=IntSlider(min=0, max=50, step=5, value=10, description='Min Value'),
val_max=IntSlider(min=50, max=150, step=10, value=100, description='Max Value'));
๐ Goodness-of-Fit Measures: Overview and Interpretation#
Goodness-of-fit metrics quantify how well a model or simulation replicates observed data. These measures help evaluate prediction accuracy, error magnitude, and overall model reliability.
๐ข Common Goodness-of-Fit Metrics#
Measure |
Formula |
Interpretation |
---|---|---|
RMSE (Root Mean Square Error) |
$\( \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (O_i - P_i)^2 } \)$ |
Penalizes large errors; lower is better |
MAE (Mean Absolute Error) |
$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |
O_i - P_i |
NSE (NashโSutcliffe Efficiency) |
$\( \text{NSE} = 1 - \frac{ \sum (O_i - P_i)^2 }{ \sum (O_i - \bar{O})^2 } \)$ |
Compares model to mean of observed; 1 is perfect, 0 is mean model |
Rยฒ (Coefficient of Determination) |
$\( R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} \)$ |
Proportion of variance explained; closer to 1 is better |
Log-Normalized RMSE |
$\( \text{Log-RMSE} = \sqrt{ \frac{1}{n} \sum (\log O_i - \log P_i)^2 } \)$ |
Useful when data spans orders of magnitude; reduces skew from large values |
Bias (Mean Error) |
$\( \text{Bias} = \frac{1}{n} \sum (P_i - O_i) \)$ |
Indicates systematic over- or under-prediction |
Percent Bias (PBIAS) |
$\( \text{PBIAS} = 100 \times \frac{ \sum (O_i - P_i) }{ \sum O_i } \)$ |
Expresses bias as percentage; ideal value is 0 |
๐ Interpretation Guidelines#
Metric |
Ideal Value |
Notes |
---|---|---|
RMSE, MAE |
โ 0 |
Lower values indicate better fit |
NSE, Rยฒ |
โ 1 |
Higher values indicate better predictive power |
Bias, PBIAS |
โ 0 |
Positive = overprediction; Negative = underprediction |
Log-RMSE |
Contextual |
Useful for log-transformed or multiplicative error models |
๐งช Use Cases#
Scenario |
Recommended Metric(s) |
---|---|
General error magnitude |
RMSE, MAE |
Hydrology / environmental models |
NSE, PBIAS |
Log-transformed or skewed data |
Log-RMSE |
Model bias detection |
Bias, Percent Bias |
Regression performance |
Rยฒ, RMSE |
โ ๏ธ Limitations#
RMSE is sensitive to large errors (outliers).
NSE can be misleading if observed variance is low.
Log-RMSE requires strictly positive values.
Rยฒ does not indicate bias or error magnitude.
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider, Output
# Output widget
out = Output()
# Standard growth model (e.g., logistic-like)
def generate_observed_growth(n, r=0.2, K=100, P0=10):
t = np.arange(n)
observed = K / (1 + ((K - P0) / P0) * np.exp(-r * t))
return t, observed
# Simulated growth model (user-controlled)
def generate_simulated_growth(n, r_sim=0.25, K_sim=95, P0_sim=12):
t = np.arange(n)
simulated = K_sim / (1 + ((K_sim - P0_sim) / P0_sim) * np.exp(-r_sim * t))
return simulated
# Goodness-of-fit metrics
def compute_metrics(observed, predicted):
eps = 1e-6
mean_obs = np.mean(observed)
residuals = observed - predicted
rmse = np.sqrt(np.mean(residuals**2))
mae = np.mean(np.abs(residuals))
bias = np.mean(predicted - observed)
pbias = 100 * np.sum(observed - predicted) / np.sum(observed)
nse = 1 - np.sum(residuals**2) / np.sum((observed - mean_obs)**2)
r2 = 1 - np.sum(residuals**2) / np.sum((observed - mean_obs)**2)
log_rmse = np.sqrt(np.mean((np.log(observed + eps) - np.log(predicted + eps))**2))
return {
"RMSE": rmse,
"MAE": mae,
"Bias": bias,
"PBIAS (%)": pbias,
"NSE": nse,
"Rยฒ": r2,
"Log-RMSE": log_rmse
}
# Interactive comparison
def compare_growth_models(n=30, r_sim=0.25, K_sim=95, P0_sim=12):
t, observed = generate_observed_growth(n)
simulated = generate_simulated_growth(n, r_sim, K_sim, P0_sim)
metrics = compute_metrics(observed, simulated)
out.clear_output()
with out:
print("๐ Goodness-of-Fit Metrics:")
for k, v in metrics.items():
print(f"{k}: {v:.4f}")
print("\n๐ Interpretation:")
print("โ
Lower RMSE/MAE โ better fit")
print("โ
NSE and Rยฒ close to 1 โ strong predictive power")
print("โ
Bias and PBIAS near 0 โ minimal systematic error")
print("โ
Log-RMSE useful for skewed or multiplicative data")
# Plot comparison
plt.figure(figsize=(10, 5))
plt.plot(t, observed, label="Observed Growth", marker='o')
plt.plot(t, simulated, label="Simulated Growth", marker='x')
plt.title("Observed vs. Simulated Growth Model")
plt.xlabel("Time")
plt.ylabel("Population")
plt.legend()
plt.grid(True)
plt.show()
# Interactive controls
interact(
compare_growth_models,
n=IntSlider(min=10, max=100, step=5, value=30, description="Time Steps"),
r_sim=FloatSlider(min=0.1, max=0.5, step=0.01, value=0.25, description="Growth Rate"),
K_sim=FloatSlider(min=50, max=150, step=5, value=95, description="Carrying Capacity"),
P0_sim=FloatSlider(min=5, max=30, step=1, value=12, description="Initial Population")
)
out
๐ฒ Understanding Probability and Its Role in Risk Assessment#
๐ What Is Probability?#
1 means the event is certain.
Values between 0 and 1 reflect varying degrees of uncertainty.
Probability is foundational to:
Statistics
Decision-making
Machine learning
Engineering risk analysis
๐ Why Is Probability Important?#
Domain |
Role of Probability |
---|---|
Statistics |
Models randomness in sampling, inference, and prediction. |
Engineering |
Assesses reliability, failure rates, and safety margins. |
Computer Science |
Powers algorithms in AI, cryptography, and simulations. |
Finance |
Quantifies market risk, portfolio volatility, and insurance models. |
Environmental Science |
Forecasts extreme events like floods, droughts, and pollutant dispersion. |
โ๏ธ Probability in Risk Assessment#
Risk assessment involves estimating the likelihood and impact of adverse outcomes. Probability helps:
Model uncertainty in system behavior (e.g., soil failure, structural collapse).
Quantify exposure to hazards (e.g., earthquakes, contamination).
Support decision-making under uncertainty (e.g., cost-benefit analysis, safety thresholds).
๐งช Example: Geotechnical Risk#
In slope stability analysis:
Probability of failure is estimated using soil strength variability.
Monte Carlo simulations or probabilistic limit equilibrium methods are used.
Risk = Probability ร Consequence
๐ Key Probability Concepts#
Concept |
Description |
---|---|
Experiment |
A repeatable process with uncertain outcome (e.g., rolling a die). |
Sample Space |
The set of all possible outcomes. |
Event |
A subset of outcomes (e.g., rolling an even number). |
Probability Rule |
\(P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total outcomes}}\) |
Complement Rule |
\(P(\text{not } A) = 1 - P(A)\) |
Addition Rule |
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) |
Multiplication Rule |
$P(A \cap B) = P(A) \cdot P(B |
๐ฏ Interactive Simulation: Empirical Probability and Return Period#
This simulation demonstrates how to estimate the empirical probability of an event exceeding a threshold and compute its return period using synthetic data.
๐ What the Code Does#
Generates synthetic event magnitudes using a uniform distribution between
val_min
andval_max
.Allows the user to select a threshold value.
Calculates:
The number of events that exceed the threshold.
The empirical probability of exceedance:
\(P(\text{event}) = \frac{\text{Number of exceedances}}{\text{Total events}}\)The return period (or recurrence interval):
\(\text{Return Period} = \frac{1}{P(\text{event})}\)
๐ Visualization#
A histogram shows the distribution of event magnitudes.
A red dashed line marks the selected threshold.
Summary statistics are printed to reinforce interpretation.
๐งช Applications#
This simulation is useful for exploring:
Flood frequency analysis
Earthquake magnitude exceedance
Slope failure probability
Any domain where event magnitude vs. threshold matters
โ๏ธ Interactive Controls#
Sample Size
: Number of synthetic events generated.Min Value
andMax Value
: Range of event magnitudes.Threshold
: Value used to define a โcriticalโ or โextremeโ event.
๐ Example Use Case#
If you set:
val_min = 0
,val_max = 100
threshold = 80
n = 100
The simulation will estimate how often events exceed 80 and how frequently such events are expected to recur.
Would you like to extend this with time-series simulation or allow multiple thresholds for comparative risk analysis?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider, FloatSlider
# Generate synthetic time-series event magnitudes
def generate_time_series(n=100, val_min=0, val_max=100):
np.random.seed(42)
magnitudes = np.round(np.random.uniform(val_min, val_max, size=n), 2)
time = np.arange(1, n + 1)
return pd.DataFrame({'Time': time, 'Magnitude': magnitudes})
# Simulation with multiple thresholds
def multi_threshold_demo(n=100, val_min=0, val_max=100, t1=60, t2=80, t3=100):
df = generate_time_series(n, val_min, val_max)
thresholds = [t1, t2, t3]
results = []
for t in thresholds:
exceed = df['Magnitude'] > t
prob = exceed.sum() / n
rp = 1 / prob if prob > 0 else np.inf
results.append((t, prob, rp))
df[f'> {t}'] = exceed
# Display results
print("๐ Empirical Probability and Return Periods:")
for t, p, r in results:
print(f"Threshold {t:>3}: Probability = {p:.4f}, Return Period = {r:.2f} events")
# Plot time series
plt.figure(figsize=(12, 5))
plt.plot(df['Time'], df['Magnitude'], marker='o', linestyle='-', color='gray', label='Event Magnitude')
for t in thresholds:
plt.axhline(t, linestyle='--', label=f'Threshold {t}')
plt.title("Synthetic Time-Series of Event Magnitudes")
plt.xlabel("Time")
plt.ylabel("Magnitude")
plt.legend()
plt.grid(True)
plt.show()
# Interactive widget
interact(multi_threshold_demo,
n=IntSlider(min=50, max=500, step=50, value=100, description='Sample Size'),
val_min=IntSlider(min=0, max=50, step=5, value=0, description='Min Value'),
val_max=IntSlider(min=50, max=150, step=10, value=100, description='Max Value'),
t1=FloatSlider(min=0, max=150, step=1, value=60, description='Threshold 1'),
t2=FloatSlider(min=0, max=150, step=1, value=80, description='Threshold 2'),
t3=FloatSlider(min=0, max=150, step=1, value=100, description='Threshold 3'));
๐ Using Theoretical Distributions in Risk and Probability Analysis#
When empirical data is scarce, unreliable, or incomplete, we turn to theoretical probability distributions to model uncertainty and estimate risk. These distributions are based on mathematical assumptions and are widely used in engineering, environmental science, and decision analysis.
๐ Why Use Theoretical Distributions?#
Empirical distributions rely on observed data, which may be:
Too sparse (e.g., rare events like major earthquakes)
Incomplete (e.g., short monitoring periods)
Biased or noisy
Theoretical distributions provide:
A structured way to model uncertainty
Flexibility to simulate rare or extreme events
A foundation for probabilistic design and risk assessment
๐ Common Theoretical Distributions#
Distribution |
Typical Use Case |
Shape Characteristics |
---|---|---|
Normal |
Measurement errors, material properties |
Symmetric, bell-shaped |
Lognormal |
Soil permeability, rainfall intensity |
Skewed right, non-negative |
Exponential |
Time between failures, hazard occurrence |
Memoryless, decreasing tail |
Weibull |
Reliability, fatigue life |
Flexible shape for increasing/decreasing hazard |
Gumbel |
Extreme value modeling (e.g., floods, wind) |
Models maxima/minima of distributions |
Binomial |
Discrete event counts (e.g., success/failure) |
Finite trials, fixed probability |
Poisson |
Rare event frequency (e.g., landslides/year) |
Count-based, low probability |
โ๏ธ Application in Risk Assessment#
Theoretical distributions help estimate:
Probability of exceedance:
\(P(X > x)\) for a critical threshold \(x\)Return period:
\(\text{Return Period} = \frac{1}{P(\text{event})}\)Confidence intervals for uncertain parameters
Failure probabilities in reliability models
๐งช Example: Flood Risk Using Gumbel Distribution#
If historical flood data is limited, we can fit a Gumbel distribution to estimate:
Probability of a flood exceeding 500 mยณ/s
Expected return period of such an event
Design thresholds for levees or drainage systems
๐ง Key Takeaways#
Theoretical distributions are essential when empirical data is insufficient.
Choice of distribution depends on the nature of the variable (continuous vs. discrete, symmetric vs. skewed).
They enable simulation, forecasting, and probabilistic design under uncertainty.
Distribution fitting and goodness of fit#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from ipywidgets import interact, SelectMultiple, FloatSlider, IntSlider, Button, VBox, HBox, Output
# Output widget for dynamic display
out = Output()
# Available distributions
available_distributions = {
"Normal": stats.norm,
"Lognormal": stats.lognorm,
"Exponential": stats.expon,
"Weibull": stats.weibull_min
}
# Global data container
data = None
# Simulation function
def simulate_data(n=500, shape=2.0, scale=10.0):
global data
np.random.seed(42)
data = np.random.gamma(shape=shape, scale=scale, size=n)
# Fit and analyze selected distributions
def analyze(selected, threshold):
out.clear_output()
with out:
if data is None:
print("โ ๏ธ Please click 'Simulate' first.")
return
x = np.linspace(min(data), max(data), 100)
fit_results = []
# Empirical exceedance
exceed_emp = data > threshold
prob_emp = exceed_emp.sum() / len(data)
rp_emp = 1 / prob_emp if prob_emp > 0 else np.inf
print(f"๐ Empirical Probability of Exceeding {threshold:.2f}: {prob_emp:.4f}")
print(f"๐ Empirical Return Period: {rp_emp:.2f} events\n")
for name in selected:
dist = available_distributions[name]
params = dist.fit(data)
pdf_fitted = dist.pdf(x, *params)
ks_stat, ks_p = stats.kstest(data, dist.name, args=params)
log_likelihood = np.sum(dist.logpdf(data, *params))
k = len(params)
aic = 2 * k - 2 * log_likelihood
# Confidence intervals (approximate using bootstrap)
ci_lower = np.percentile(data, 2.5)
ci_upper = np.percentile(data, 97.5)
# Theoretical exceedance
prob_theo = dist.sf(threshold, *params)
rp_theo = 1 / prob_theo if prob_theo > 0 else np.inf
fit_results.append({
"Distribution": name,
"KS Statistic": round(ks_stat, 4),
"KS p-value": round(ks_p, 4),
"AIC": round(aic, 2),
"95% CI": f"[{ci_lower:.2f}, {ci_upper:.2f}]",
"P(X > threshold)": round(prob_theo, 4),
"Return Period": round(rp_theo, 2)
})
# Plot
plt.figure(figsize=(6, 4))
plt.hist(data, bins=30, density=True, alpha=0.5, label="Empirical")
plt.plot(x, pdf_fitted, label=f"{name} Fit", linewidth=2)
plt.axvline(threshold, color='red', linestyle='--', label=f"Threshold = {threshold}")
plt.title(f"{name} Distribution Fit")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.show()
# Summary table
df_results = pd.DataFrame(fit_results)
print("๐ Goodness-of-Fit and Risk Summary:")
display(df_results.sort_values("AIC"))
# Widgets
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_data())
distribution_selector = SelectMultiple(
options=list(available_distributions.keys()),
value=["Normal", "Lognormal"],
description="Distributions"
)
threshold_slider = FloatSlider(
value=80,
min=0,
max=200,
step=1,
description="Threshold"
)
analyze_btn = Button(description="Analyze", button_style='info')
analyze_btn.on_click(lambda b: analyze(distribution_selector.value, threshold_slider.value))
# Layout
VBox([
HBox([simulate_btn, analyze_btn]),
distribution_selector,
threshold_slider,
out
])
๐งช Hypothesis Testing: Concepts, Types, and Applications#
Hypothesis testing is a statistical method used to make decisions or inferences about population parameters based on sample data.
๐ฏ Core Concepts#
Null Hypothesis (\(H_0\)): Assumes no effect or no difference.
Alternative Hypothesis (\(H_1\) or \(H_a\)): Assumes a real effect or difference exists.
Test Statistic: A value computed from sample data used to decide whether to reject \(H_0\).
p-value: Probability of observing the test statistic under \(H_0\).
Significance Level (\(\alpha\)): Threshold for rejecting \(H_0\), commonly 0.05.
Confidence Interval: Range of values likely to contain the true parameter.
๐ Types of Hypothesis Tests#
Test Type |
Use Case |
Assumptions |
Example Scenario |
---|---|---|---|
Z-test |
Compare means (large \(n\), known \(\sigma\)) |
Normal distribution, known variance |
Mean weight of manufactured parts |
t-test |
Compare means (small \(n\), unknown \(\sigma\)) |
Normality, equal variance (for 2-sample) |
Drug effectiveness between two groups |
Chi-square test |
Test for independence or goodness-of-fit |
Categorical data, expected counts โฅ 5 |
Survey response vs. age group |
ANOVA |
Compare means across โฅ3 groups |
Normality, equal variance |
Crop yield across different fertilizers |
MannโWhitney U |
Non-parametric test for two independent groups |
No assumption of normality |
Median income between regions |
Wilcoxon Signed-Rank |
Non-parametric test for paired samples |
Symmetric distribution of differences |
Before/after treatment scores |
KolmogorovโSmirnov |
Compare distributions or test normality |
Continuous data |
Fit of theoretical distribution |
Log-rank test |
Compare survival curves |
Censored data, proportional hazards |
Time-to-failure in reliability studies |
๐ Decision Process#
Formulate hypotheses: \(H_0\) and \(H_1\)
Choose significance level: \(\alpha = 0.05\)
Select appropriate test based on data type and assumptions
Compute test statistic and p-value
Compare p-value to \(\alpha\):
If \(p < \alpha\): Reject \(H_0\)
If \(p \geq \alpha\): Fail to reject \(H_0\)
๐ง Applications in Engineering and Science#
Quality control: Z-test, t-test for process monitoring
Environmental analysis: ANOVA for comparing pollutant levels
Reliability: Log-rank test for component lifetimes
Geotechnical studies: Non-parametric tests for soil properties
Remote sensing: Chi-square for classification accuracy
Would you like to extend this with interactive examples or code snippets for each test?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
from ipywidgets import interact, IntSlider, FloatSlider, Button, Output, VBox, HBox
# Output widget
out = Output()
# Global data
group1 = None
group2 = None
# Simulate data
def simulate_ttest(n=30, mu1=50, mu2=55, sigma=10):
global group1, group2
np.random.seed(42)
group1 = np.random.normal(loc=mu1, scale=sigma, size=n)
group2 = np.random.normal(loc=mu2, scale=sigma, size=n)
# Run t-test and plot
def run_ttest():
out.clear_output()
with out:
if group1 is None or group2 is None:
print("โ ๏ธ Please simulate data first.")
return
# Perform t-test
stat, p = ttest_ind(group1, group2)
mean1, mean2 = np.mean(group1), np.mean(group2)
# Plot
plt.figure(figsize=(6, 4))
plt.hist(group1, bins=20, alpha=0.6, label="Group 1", color='blue')
plt.hist(group2, bins=20, alpha=0.6, label="Group 2", color='orange')
plt.axvline(mean1, color='blue', linestyle='--', label=f"Mean 1 = {mean1:.2f}")
plt.axvline(mean2, color='orange', linestyle='--', label=f"Mean 2 = {mean2:.2f}")
plt.title("Distribution of Two Groups")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.grid(True)
plt.show()
# Summary
print("๐ Two-Sample t-Test Results:")
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p:.4f}")
if p < 0.05:
print("โ
Significant difference (reject Hโ)")
else:
print("โ No significant difference (fail to reject Hโ)")
# Widgets
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_ttest(n_slider.value, mu1_slider.value, mu2_slider.value))
run_btn = Button(description="Run t-Test", button_style='info')
run_btn.on_click(lambda b: run_ttest())
n_slider = IntSlider(value=30, min=10, max=100, step=5, description="Sample Size")
mu1_slider = FloatSlider(value=50, min=30, max=70, step=1, description="Mean Group 1")
mu2_slider = FloatSlider(value=55, min=30, max=70, step=1, description="Mean Group 2")
# Layout
VBox([
HBox([simulate_btn, run_btn]),
n_slider,
mu1_slider,
mu2_slider,
out
])
Hypothesis Testing#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from ipywidgets import Dropdown, IntSlider, FloatSlider, Button, Output, VBox, HBox
# Output widget
out = Output()
# Global data containers
data_store = {}
# Test options
test_options = [
"Z-test",
"t-test",
"Chi-square test",
"ANOVA",
"MannโWhitney U",
"Wilcoxon Signed-Rank",
"KolmogorovโSmirnov",
"Log-rank test (simulated)"
]
# Simulate data
def simulate_data(n, mu1, mu2):
np.random.seed(42)
data_store.clear()
data_store["x1"] = np.random.normal(mu1, 10, n)
data_store["x2"] = np.random.normal(mu2, 10, n)
data_store["before"] = np.random.normal(mu1, 10, n)
data_store["after"] = data_store["before"] + np.random.normal(mu2 - mu1, 5, n)
data_store["anova"] = [
np.random.normal(50, 10, n),
np.random.normal(55, 10, n),
np.random.normal(60, 10, n)
]
data_store["survival1"] = np.random.exponential(scale=10, size=n)
data_store["survival2"] = np.random.exponential(scale=15, size=n)
# Analyze selected test
def analyze_test(test_type):
out.clear_output()
with out:
if not data_store:
print("โ ๏ธ Please click 'Simulate' first.")
return
def interpret(p, alpha=0.05):
if p < alpha:
return f"โ
Result is statistically significant (p < {alpha}). Reject the null hypothesis."
else:
return f"โ Result is not statistically significant (p โฅ {alpha}). Fail to reject the null hypothesis."
if test_type == "Z-test":
mu = 10
x1, x2 = data_store["x1"], data_store["x2"]
z = (np.mean(x1) - np.mean(x2)) / np.sqrt(mu**2/len(x1) + mu**2/len(x2))
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Z-test: z = {z:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "t-test":
stat, p = stats.ttest_ind(data_store["x1"], data_store["x2"])
print(f"t-test: t = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "Chi-square test":
observed = np.array([[20, 30], [25, 25]])
stat, p, _, _ = stats.chi2_contingency(observed)
print(f"Chi-square test: ฯยฒ = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "ANOVA":
stat, p = stats.f_oneway(*data_store["anova"])
print(f"ANOVA: F = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "MannโWhitney U":
stat, p = stats.mannwhitneyu(data_store["x1"], data_store["x2"])
print(f"MannโWhitney U: U = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "Wilcoxon Signed-Rank":
stat, p = stats.wilcoxon(data_store["before"], data_store["after"])
print(f"Wilcoxon Signed-Rank: W = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "KolmogorovโSmirnov":
x = data_store["x1"]
stat, p = stats.kstest(x, 'norm', args=(np.mean(x), np.std(x)))
print(f"KolmogorovโSmirnov: D = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
elif test_type == "Log-rank test (simulated)":
stat, p = stats.ttest_ind(data_store["survival1"], data_store["survival2"])
print(f"Log-rank (simulated): t = {stat:.4f}, p = {p:.4f}")
print(interpret(p))
# Optional plot
plt.figure(figsize=(6, 4))
if test_type in ["Z-test", "t-test", "MannโWhitney U"]:
plt.hist(data_store["x1"], bins=20, alpha=0.6, label="Group 1")
plt.hist(data_store["x2"], bins=20, alpha=0.6, label="Group 2")
elif test_type == "Wilcoxon Signed-Rank":
plt.plot(data_store["before"], label="Before", marker='o')
plt.plot(data_store["after"], label="After", marker='x')
elif test_type == "ANOVA":
plt.boxplot(data_store["anova"], labels=["Group 1", "Group 2", "Group 3"])
elif test_type == "KolmogorovโSmirnov":
x = data_store["x1"]
plt.hist(x, bins=20, alpha=0.7, label="Sample")
plt.plot(np.sort(x), stats.norm.pdf(np.sort(x), np.mean(x), np.std(x)), label="Normal PDF")
elif test_type == "Log-rank test (simulated)":
plt.hist(data_store["survival1"], bins=20, alpha=0.6, label="Group 1")
plt.hist(data_store["survival2"], bins=20, alpha=0.6, label="Group 2")
plt.title(f"{test_type} Visualization")
plt.legend()
plt.grid(True)
plt.show()
# Widgets
test_selector = Dropdown(options=test_options, value="t-test", description="Test Type")
n_slider = IntSlider(value=30, min=10, max=100, step=5, description="Sample Size")
mu1_slider = FloatSlider(value=50, min=30, max=70, step=1, description="Mean Group 1")
mu2_slider = FloatSlider(value=55, min=30, max=70, step=1, description="Mean Group 2")
simulate_btn = Button(description="Simulate", button_style='success')
simulate_btn.on_click(lambda b: simulate_data(n_slider.value, mu1_slider.value, mu2_slider.value))
analyze_btn = Button(description="Analyze", button_style='info')
analyze_btn.on_click(lambda b: analyze_test(test_selector.value))
# Layout
VBox([
HBox([simulate_btn, analyze_btn]),
test_selector,
n_slider,
mu1_slider,
mu2_slider,
out
])