Chapter 2 Hydrology: Flood Frequency#
1. Introduction#
🌊 Flood Frequency Analysis#
Flood Frequency Analysis (FFA) is a statistical method used to estimate the probability of different magnitudes of flood events occurring at a specific location over time [Cunnane, 1989, Holland et al., 2016, Jr. et al., 2018].
🔍 Key Concepts#
Annual Maximum Series (AMS): Uses the highest flow recorded each year.
Return Period (T): Average interval between floods of a given magnitude.
Exceedance Probability (p):
$\( p = \frac{1}{T} \)\( where \)T$ is the return period.
📊 Importance of Observational Data#
Observational data are crucial for conducting reliable flood frequency analyses.
✅ Benefits:#
Long-term records improve statistical reliability.
Site-specific data reflect local hydrologic behavior.
Empirical distributions can be fitted directly.
⚠️ Limitations with Sparse Data:#
High uncertainty in flood quantile estimates.
Poor representation of rare, extreme events.
Skewness and variance may be biased.
📐 Theoretical Distributions for Extreme Events#
When data are limited, theoretical distributions help extrapolate flood probabilities.
🧠 Common Distributions#
Distribution |
Type |
Use Case |
Notes |
---|---|---|---|
Gumbel (EV1) |
Extreme Value |
AMS floods |
Assumes exponential tail |
Log-Pearson Type III |
Skewed |
US standard (Bulletin 17C) |
Handles skewness well |
Generalized Extreme Value (GEV) |
Flexible |
Global use |
Includes Gumbel, Fréchet, Weibull |
Weibull |
Empirical |
Plotting positions |
Used for ranking floods |
📊 Empirical vs. Theoretical Distributions in Extreme Value Analysis#
🔍 Empirical Distribution#
Definition: Based directly on observed data (e.g., annual peak discharges).
Construction: Uses ranking and plotting positions (e.g., Weibull formula) to estimate exceedance probabilities.
Strengths:
Reflects site-specific behavior.
No assumptions about underlying distribution.
Limitations:
Requires long-term data for reliability.
Poor extrapolation for rare/extreme events.
Sensitive to outliers and sampling variability.
📐 Theoretical Distribution#
Definition: Assumes data follow a known probability distribution (e.g., Gumbel, Log-Pearson III, GEV).
Construction: Fits a parametric model to data using methods like Maximum Likelihood Estimation (MLE).
Strengths:
Enables extrapolation beyond observed range.
Provides analytical expressions for return periods and quantiles.
Limitations:
Requires assumptions about distribution type.
Sensitive to parameter estimation, especially with sparse data.
🌊 Why Use Theoretical Distributions for Extreme Events?#
Extreme events (e.g., 100-year floods) are rare and often absent from short observational records. Theoretical distributions allow us to estimate their magnitude and frequency by modeling the tail behavior of the data.
📈 Most Popular Distributions for Extremes#
Distribution |
Type |
Notes |
---|---|---|
Generalized Extreme Value (GEV) |
Block maxima |
Unifies Gumbel, Fréchet, Weibull; widely used globally |
Log-Pearson Type III |
Skewed |
Official standard in the U.S. (Bulletin 17C) |
Gumbel (EV1) |
Light-tailed |
Historically common for flood peaks |
🌐 Regional Flood Frequency Analysis (RFFA)#
Used when site-specific data are sparse or unavailable.
🗺️ Key Features:#
Homogeneous Region: Group sites with similar hydrologic characteristics.
Index Flood Method: Normalize data by a regional index (e.g., mean annual flood).
Regional Regression: Use basin attributes to estimate flood quantiles.
L-Moments or Bayesian Methods: Improve parameter estimation across sites.
🌍 Most Widely Used:#
United States: Log-Pearson Type III (per USGS Bulletin 17C)
Globally: Generalized Extreme Value (GEV)
🧩 Hybrid Approaches#
When observational data are sparse:
Bayesian Estimation: Combines prior knowledge with observed data.
Regional Skew Adjustment: Blends site-specific and regional skew.
Expected Moments Algorithm (EMA): Handles censored and interval data.
🌐 Regional Flood Frequency Analysis (RFFA)#
Used when site-specific data are insufficient.
🗺️ Key Steps:#
Define Homogeneous Region: Based on climate, physiography, land use.
Index Flood Method: Normalize peak flows by a regional index.
Regional Regression Models: Use basin characteristics to estimate flood quantiles.
L-Moments or GAMLSS: For parameter estimation across sites.
📘 USGS Practice:#
Uses regional skewness maps and regression equations.
Bulletin 17C recommends combining at-site and regional skew.
🧠 Summary#
Flood frequency analysis is essential for infrastructure design and risk management. While observational data are ideal, theoretical distributions—especially GEV and Log-Pearson Type III—enable extrapolation when data are sparse. Regional methods fill gaps for ungauged sites, ensuring robust flood risk estimates.
Foundational Literature#
Flood Frequency Analysis (FFA) is a statistical method used to estimate the probability of different magnitudes of flood events occurring at a specific location over time and [Cunnane, 1989, Holland et al., 2016, Jr. et al., 2018] provides foundational knowledge on extreme FFA. These works collectively establish the statistical frameworks, methodological advancements, and practical guidelines for estimating the probability of flood events of varying magnitudes at specific locations over time. [Cunnane, 1989] provides a comprehensive review of statistical distributions used in FFA, including selection criteria for distribution, and serves as a global reference for hydrologists conducting design flood estimation. [Holland et al., 2016]- Evaluates the relevance of at-site FFA methods for extreme events using a simulation-based approach to improve reliability in estimating extreme flood probabilities. [Jr. et al., 2018] is an update to Bulletin 17B; it addresses challenges in estimating rare floods and supports risk-informed design, and represents the U.S. national standard for flood frequency estimation. Together, these sources form the core methodological and practical basis for assessing extreme flood risk and designing infrastructure.
Flood frequency analysis tool#
This tool reads peak flow data from the USGS NWIS database and fits 10 commonly used extreme value probability distributions to estimate flood magnitudes associated with various return periods (e.g., 2-year, 100-year). It performs statistical goodness-of-fit evaluation and provides an interactive interface to visualize the flood frequency curve for each distribution.
What the Tool Does#
✅ Reads annual peak discharge data from a NWIS
.txt
file✅ Fits multiple statistical distributions to the observed peak flows
✅ Computes estimated flood quantiles for specific return periods (2, 5, 10, 25, 50, 100 years)
✅ Calculates RMSE and Kolmogorov–Smirnov (KS) goodness-of-fit metrics
✅ Allows the user to interactively select a distribution and view:
Estimated peak flows
Distribution parameters
GOF statistics
A flood frequency curve plotted in log scale
How to Use#
Prepare Input File
Download annual peak streamflow data from the USGS NWIS Peak Flow site
Save as a tab-delimited
.txt
file (e.g.,07022500_nwis_peak.txt
)
Run the Script in Jupyter Notebook
Place the file in your working directory
Modify the line
usgs_file = "07022500_nwis_peak.txt"
to match your filenameRun the script cell-by-cell
Explore Results
View the summary table of fitted distribution parameters and their statistical performance
Use the dropdown selector to compare estimated flood flows and curves for each distribution
Theoretical Background: Distributions Used#
Each distribution estimates the probability of rare flood events based on historical data. Here’s a quick reference:
Distribution |
Description |
Parameters |
---|---|---|
Gumbel (EV1) |
Models block maxima (e.g., annual max). Skewed right. |
Location (μ), Scale (β) |
Log-Pearson III |
Log-transformed Pearson Type III. Used in U.S. federal flood studies. |
Shape (α), Location (μ), Scale |
GEV |
General form for extremes. Includes Gumbel, Frechet, Weibull as cases. |
Shape (ξ), Location, Scale |
Normal |
Symmetric bell curve. May misrepresent skewed flood data. |
Mean (μ), Std. dev. (σ) |
Lognormal |
Data is normally distributed after log transform. Skewed right. |
Shape (σ), Location, Scale |
Weibull (Type III) |
Useful for extreme minimums or upper tails. |
Shape (k), Location, Scale |
Exponential |
Special case of Weibull; constant failure rate (rarely used for floods). |
Rate (λ) or Scale |
Gamma |
General skewed distribution, flexible fit for hydrology |
Shape (k), Scale (θ), Location |
Loglogistic (Fisk) |
Skewed right, like lognormal but heavier tail. |
Shape (c), Location, Scale |
Generalized Pareto |
Models excesses over a threshold (POT approach). |
Shape, Location, Scale |
Performance Evaluation Criteria#
Two statistical metrics assess how well each distribution fits the observed data:
🔹 Root Mean Squared Error (RMSE)
Measures average error between observed peak flows and estimated quantiles from the distribution: $\( \text{RMSE} = \sqrt{ \frac{1}{n} \sum (Q_{\text{obs}} - Q_{\text{est}})^2 } \)$ Lower values indicate a better fit.
🔹 Kolmogorov–Smirnov (KS) Statistic
Measures the maximum difference between the empirical cumulative distribution function (ECDF) and the theoretical CDF: $\( D = \sup_x |F_n(x) - F(x)| \)$
Returns both the KS statistic and a p-value
If p-value > 0.05: distribution is a statistically valid fit (✅ Pass)
Output Summary#
A sorted summary table of all distributions including:
Fitted parameters
RMSE
KS statistic and p-value
Pass/fail interpretation
Interactive flood frequency plots for return periods on a log-x axis
Ability to choose which distribution best represents the dataset
Applications#
Floodplain mapping
Hydraulic structure design (culverts, bridges, dams)
Return period–based risk estimation
Hydrologic modeling calibration
Let me know if you’d like this tool extended with confidence intervals, percentile shading, or exported reports in Excel or PDF!
2. Simulation#
import pandas as pd
from io import StringIO
# Simulate the file content as a string
data = """\
agency_cd site_no peak_dt peak_va
USGS 7022500 3/3/1953 780
USGS 7022500 1/20/1954 520
USGS 7022500 3/20/1955 846
USGS 7022500 2/2/1956 440
USGS 7022500 1/22/1957 707
USGS 7022500 11/18/1957 763
USGS 7022500 8/6/1959 514
USGS 7022500 7/3/1960 602
USGS 7022500 3/12/1961 304
USGS 7022500 9/16/1962 833
USGS 7022500 3/4/1963 228
USGS 7022500 3/4/1964 690
USGS 7022500 3/29/1965 1870
USGS 7022500 3/20/1968 545
USGS 7022500 4/9/1969 702
USGS 7022500 6/13/1970 295
USGS 7022500 8/21/1971 987
USGS 7022500 7/28/1972 350
USGS 7022500 5/1/1973 816
USGS 7022500 11/24/1973 900
USGS 7022500 3/29/1975 1160
USGS 7022500 2/17/1976 800
USGS 7022500 6/26/1977 748
USGS 7022500 3/14/1978 728
USGS 7022500 12/3/1978 1460
USGS 7022500 3/17/1980 364
USGS 7022500 6/6/1981 975
USGS 7022500 1/4/1982 1080
USGS 7022500 6/3/1983 2760
USGS 7022500 4/29/1984 954
USGS 7022500 9/5/1985 660
USGS 7022500 5/24/1986 360
USGS 7022500 2/28/1987 415
"""
# Read it as a DataFrame using tab separator
df = pd.read_csv(StringIO(data), sep="\t")
#df = pd.read_csv(StringIO(data), delim_whitespace=True)
df.columns = df.columns.str.strip()
display(df.head())
# Display the first few rows
#df.head()
print(df['peak_va'])
#peak_df["peak_va"]
agency_cd site_no peak_dt peak_va | |
---|---|
0 | USGS 7022500 3/3/1953 780 |
1 | USGS 7022500 1/20/1954 520 |
2 | USGS 7022500 3/20/1955 846 |
3 | USGS 7022500 2/2/1956 440 |
4 | USGS 7022500 1/22/1957 707 |
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'peak_va'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[3], line 51
47 display(df.head())
48 # Display the first few rows
49 #df.head()
---> 51 print(df['peak_va'])
File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'peak_va'
### Required Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import kstest
from sklearn.metrics import mean_squared_error
from ipywidgets import interact, Dropdown
from IPython.display import display
#pip install scipy
### Load NWIS Peak Flow Data
# def read_nwis_peak_file(file_path):
# try:
# with open(file_path, 'r') as f:
# lines = f.readlines()
# start_line = next(i for i, line in enumerate(lines) if not line.startswith('#'))
# df = pd.read_csv(
# file_path,
# sep='\t',
# comment='#',
# header=0,
# dtype=str,
# engine='python'
# )
# df.columns = df.columns.str.strip()
# df['peak_dt'] = pd.to_datetime(df['peak_dt'], errors='coerce')
# df['peak_va'] = pd.to_numeric(df['peak_va'], errors='coerce')
# df_clean = df[['site_no', 'peak_dt', 'peak_va']].dropna()
# return df_clean
# except Exception as e:
# print(f"❌ Error reading file: {e}")
# return pd.DataFrame()
### Load & Preview Data
# usgs_file = "07022500_nwis_peak.txt"
# peak_df = read_nwis_peak_file(usgs_file)
# display(peak_df.head())
### Set Up Distribution Parameters
distributions = {
"Gumbel (EV1)": stats.gumbel_r,
"Log-Pearson III": stats.pearson3,
"GEV": stats.genextreme,
"Normal": stats.norm,
"Lognormal": stats.lognorm,
"Weibull": stats.weibull_min,
"Exponential": stats.expon,
"Gamma": stats.gamma,
"Loglogistic": stats.fisk,
"Generalized Pareto": stats.genpareto
}
### Define Probability Array
#peak_values = peak_df['peak_va'] #peak_df['peak_va'].dropna()
peak_values=df['peak_va']
sorted_data = np.sort(peak_values)
prob_plot = np.linspace(0.01, 0.99, 100)
return_periods = 1 / (1 - prob_plot)
prob_exceed = 0.01 # For 100-year flood estimate
### Fit Distributions
summary_rows = []
fit_results = {}
for name, dist in distributions.items():
try:
params = dist.fit(peak_values)
flood_q = dist.ppf(1 - prob_exceed, *params)
# q_estimates = dist.ppf(prob_plot, *params)
# Generate quantiles from same size as data
prob_plot = np.linspace(0.01, 0.99, len(sorted_data)) # now it's 33 points
q_estimates = dist.ppf(prob_plot, *params)
rmse = np.sqrt(mean_squared_error(sorted_data, q_estimates))
ks_stat, ks_pval = kstest(peak_values, dist.cdf, args=params)
fit_results[name] = {
"params": params,
"q": dist.ppf(1 - 1 / return_periods, *params),
"rmse": rmse,
"ks_stat": ks_stat,
"ks_pval": ks_pval
}
param_str = ", ".join([f"{p:.2f}" for p in params])
summary_rows.append({
"Distribution": name,
"Parameters": param_str,
"RMSE (cfs)": round(rmse, 2),
"KS Stat": round(ks_stat, 3),
"KS p-value": round(ks_pval, 3),
"KS Result": "✅ Pass" if ks_pval > 0.05 else "❌ Reject"
})
except Exception as e:
print(f"⚠️ Could not fit {name}: {e}")
### Summary Table
summary_df = pd.DataFrame(summary_rows).sort_values(by="RMSE (cfs)")
print("\n📊 Goodness-of-Fit Summary for All Distributions:\n")
display(summary_df)
### Interactive Plotting
def plot_selected_distribution(dist_name):
result = fit_results[dist_name]
q = result["q"]
params = result["params"]
param_str = ", ".join([f"{p:.2f}" for p in params])
plt.figure(figsize=(8, 5))
plt.plot(return_periods, q, marker='o', linestyle='-', color='royalblue', label="Estimated Peak Flow")
plt.xscale('log')
plt.xlabel("Return Period (years, log scale)")
plt.ylabel("Estimated Peak Flow (cfs)")
plt.title(f"{dist_name} Flood Frequency Curve\nParameters: {param_str}")
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend()
plt.tight_layout()
plt.show()
# Tabular Output
df_plot = pd.DataFrame({
"Return Period (yr)": return_periods.round(1),
"Estimated Peak Flow (cfs)": q.round(2)
})
print(f"\n📌 Parameters: {param_str}")
print(f"RMSE: {result['rmse']:.2f}, KS stat: {result['ks_stat']:.3f}, p-value: {result['ks_pval']:.3f}")
display(df_plot)
### Launch Widget
interact(plot_selected_distribution, dist_name=Dropdown(options=list(fit_results.keys()), description="Distribution"))
3. Self-Assessment#
Self-Assessment: Flood Frequency Analysis Tool#
Use these prompts and questions to evaluate your understanding of the tool and its underlying hydrologic and statistical concepts.
Conceptual Questions#
Why are return periods plotted on a logarithmic scale in flood frequency analysis?
Hint: Think about how frequent vs. rare events are distributed.
What is the purpose of fitting multiple distributions to the same peak flow dataset?
Hint: No single distribution fits all scenarios equally well.
How do Gringorten plotting positions help in flood frequency analysis?
Hint: They’re used to assign empirical probabilities to ordered data.
What assumptions underlie the use of the Gumbel distribution in hydrology?
Hint: It’s designed to model block maxima like annual peak flows.
How do parametric and non-parametric flood frequency methods differ in their approach?
Hint: Consider how the data distribution is treated.
Reflective Prompts#
If two distributions yield similar RMSE but different KS p-values, which metric is more important for selecting a model—and why?
Can a statistically good-fitting distribution be inappropriate for design applications? Provide an example.
How would you adapt this tool to process data from multiple gage stations simultaneously?
What limitations might this tool face when applied to future climate-affected streamflow patterns?
How would the analysis change if you used partial-duration series instead of annual maxima?
Quiz Questions#
Q1. The Gumbel distribution is commonly used to model:
A. Rainfall intensity
B. Annual maximum values
C. Median flow durations
D. Baseflow during drought
✅ Correct: B
Q2. The Kolmogorov–Smirnov test compares:
A. Log and normal distributions
B. ECDF and theoretical CDF
C. Mean annual rainfall
D. Number of peaks above threshold
✅ Correct: B
Q3. In the Generalized Extreme Value distribution, the shape parameter controls:
A. Peak discharge
B. Tail behavior
C. Cumulative runoff
D. Frequency of low flows
✅ Correct: B
Q4. A high KS p-value and low RMSE suggest:
A. Overfitting
B. Good model fit
C. Poor data resolution
D. Statistical bias
✅ Correct: B
Q5. Which distribution is least appropriate for positively skewed hydrologic data?
A. Gumbel
B. Lognormal
C. Normal
D. Log-Pearson III
✅ Correct: C