Chapter 2 Hydrology: Flood Frequency#

  1. Introduction: Flood frequency

  2. Simulation: EVD performance

  3. Self-Assessment

1. Introduction#

🌊 Flood Frequency Analysis#

Flood Frequency Analysis (FFA) is a statistical method used to estimate the probability of different magnitudes of flood events occurring at a specific location over time [Cunnane, 1989, Holland et al., 2016, Jr. et al., 2018].

🔍 Key Concepts#

  • Annual Maximum Series (AMS): Uses the highest flow recorded each year.

  • Return Period (T): Average interval between floods of a given magnitude.

  • Exceedance Probability (p):
    $\( p = \frac{1}{T} \)\( where \)T$ is the return period.


📊 Importance of Observational Data#

Observational data are crucial for conducting reliable flood frequency analyses.

✅ Benefits:#

  • Long-term records improve statistical reliability.

  • Site-specific data reflect local hydrologic behavior.

  • Empirical distributions can be fitted directly.

⚠️ Limitations with Sparse Data:#

  • High uncertainty in flood quantile estimates.

  • Poor representation of rare, extreme events.

  • Skewness and variance may be biased.


📐 Theoretical Distributions for Extreme Events#

When data are limited, theoretical distributions help extrapolate flood probabilities.

🧠 Common Distributions#

Distribution

Type

Use Case

Notes

Gumbel (EV1)

Extreme Value

AMS floods

Assumes exponential tail

Log-Pearson Type III

Skewed

US standard (Bulletin 17C)

Handles skewness well

Generalized Extreme Value (GEV)

Flexible

Global use

Includes Gumbel, Fréchet, Weibull

Weibull

Empirical

Plotting positions

Used for ranking floods

📊 Empirical vs. Theoretical Distributions in Extreme Value Analysis#

🔍 Empirical Distribution#

  • Definition: Based directly on observed data (e.g., annual peak discharges).

  • Construction: Uses ranking and plotting positions (e.g., Weibull formula) to estimate exceedance probabilities.

  • Strengths:

    • Reflects site-specific behavior.

    • No assumptions about underlying distribution.

  • Limitations:

    • Requires long-term data for reliability.

    • Poor extrapolation for rare/extreme events.

    • Sensitive to outliers and sampling variability.


📐 Theoretical Distribution#

  • Definition: Assumes data follow a known probability distribution (e.g., Gumbel, Log-Pearson III, GEV).

  • Construction: Fits a parametric model to data using methods like Maximum Likelihood Estimation (MLE).

  • Strengths:

    • Enables extrapolation beyond observed range.

    • Provides analytical expressions for return periods and quantiles.

  • Limitations:

    • Requires assumptions about distribution type.

    • Sensitive to parameter estimation, especially with sparse data.


🌊 Why Use Theoretical Distributions for Extreme Events?#

Extreme events (e.g., 100-year floods) are rare and often absent from short observational records. Theoretical distributions allow us to estimate their magnitude and frequency by modeling the tail behavior of the data.


🌐 Regional Flood Frequency Analysis (RFFA)#

Used when site-specific data are sparse or unavailable.

🗺️ Key Features:#

  • Homogeneous Region: Group sites with similar hydrologic characteristics.

  • Index Flood Method: Normalize data by a regional index (e.g., mean annual flood).

  • Regional Regression: Use basin attributes to estimate flood quantiles.

  • L-Moments or Bayesian Methods: Improve parameter estimation across sites.


🌍 Most Widely Used:#

  • United States: Log-Pearson Type III (per USGS Bulletin 17C)

  • Globally: Generalized Extreme Value (GEV)


🧩 Hybrid Approaches#

When observational data are sparse:

  • Bayesian Estimation: Combines prior knowledge with observed data.

  • Regional Skew Adjustment: Blends site-specific and regional skew.

  • Expected Moments Algorithm (EMA): Handles censored and interval data.


🌐 Regional Flood Frequency Analysis (RFFA)#

Used when site-specific data are insufficient.

🗺️ Key Steps:#

  1. Define Homogeneous Region: Based on climate, physiography, land use.

  2. Index Flood Method: Normalize peak flows by a regional index.

  3. Regional Regression Models: Use basin characteristics to estimate flood quantiles.

  4. L-Moments or GAMLSS: For parameter estimation across sites.

📘 USGS Practice:#

  • Uses regional skewness maps and regression equations.

  • Bulletin 17C recommends combining at-site and regional skew.


🧠 Summary#

Flood frequency analysis is essential for infrastructure design and risk management. While observational data are ideal, theoretical distributions—especially GEV and Log-Pearson Type III—enable extrapolation when data are sparse. Regional methods fill gaps for ungauged sites, ensuring robust flood risk estimates.

Foundational Literature#

Flood Frequency Analysis (FFA) is a statistical method used to estimate the probability of different magnitudes of flood events occurring at a specific location over time and [Cunnane, 1989, Holland et al., 2016, Jr. et al., 2018] provides foundational knowledge on extreme FFA. These works collectively establish the statistical frameworks, methodological advancements, and practical guidelines for estimating the probability of flood events of varying magnitudes at specific locations over time. [Cunnane, 1989] provides a comprehensive review of statistical distributions used in FFA, including selection criteria for distribution, and serves as a global reference for hydrologists conducting design flood estimation. [Holland et al., 2016]- Evaluates the relevance of at-site FFA methods for extreme events using a simulation-based approach to improve reliability in estimating extreme flood probabilities. [Jr. et al., 2018] is an update to Bulletin 17B; it addresses challenges in estimating rare floods and supports risk-informed design, and represents the U.S. national standard for flood frequency estimation. Together, these sources form the core methodological and practical basis for assessing extreme flood risk and designing infrastructure.

Flood frequency analysis tool#

This tool reads peak flow data from the USGS NWIS database and fits 10 commonly used extreme value probability distributions to estimate flood magnitudes associated with various return periods (e.g., 2-year, 100-year). It performs statistical goodness-of-fit evaluation and provides an interactive interface to visualize the flood frequency curve for each distribution.


What the Tool Does#

  • ✅ Reads annual peak discharge data from a NWIS .txt file

  • ✅ Fits multiple statistical distributions to the observed peak flows

  • ✅ Computes estimated flood quantiles for specific return periods (2, 5, 10, 25, 50, 100 years)

  • ✅ Calculates RMSE and Kolmogorov–Smirnov (KS) goodness-of-fit metrics

  • ✅ Allows the user to interactively select a distribution and view:

    • Estimated peak flows

    • Distribution parameters

    • GOF statistics

    • A flood frequency curve plotted in log scale


How to Use#

  1. Prepare Input File

    • Download annual peak streamflow data from the USGS NWIS Peak Flow site

    • Save as a tab-delimited .txt file (e.g., 07022500_nwis_peak.txt)

  2. Run the Script in Jupyter Notebook

    • Place the file in your working directory

    • Modify the line usgs_file = "07022500_nwis_peak.txt" to match your filename

    • Run the script cell-by-cell

  3. Explore Results

    • View the summary table of fitted distribution parameters and their statistical performance

    • Use the dropdown selector to compare estimated flood flows and curves for each distribution


Theoretical Background: Distributions Used#

Each distribution estimates the probability of rare flood events based on historical data. Here’s a quick reference:

Distribution

Description

Parameters

Gumbel (EV1)

Models block maxima (e.g., annual max). Skewed right.

Location (μ), Scale (β)

Log-Pearson III

Log-transformed Pearson Type III. Used in U.S. federal flood studies.

Shape (α), Location (μ), Scale

GEV

General form for extremes. Includes Gumbel, Frechet, Weibull as cases.

Shape (ξ), Location, Scale

Normal

Symmetric bell curve. May misrepresent skewed flood data.

Mean (μ), Std. dev. (σ)

Lognormal

Data is normally distributed after log transform. Skewed right.

Shape (σ), Location, Scale

Weibull (Type III)

Useful for extreme minimums or upper tails.

Shape (k), Location, Scale

Exponential

Special case of Weibull; constant failure rate (rarely used for floods).

Rate (λ) or Scale

Gamma

General skewed distribution, flexible fit for hydrology

Shape (k), Scale (θ), Location

Loglogistic (Fisk)

Skewed right, like lognormal but heavier tail.

Shape (c), Location, Scale

Generalized Pareto

Models excesses over a threshold (POT approach).

Shape, Location, Scale


Performance Evaluation Criteria#

Two statistical metrics assess how well each distribution fits the observed data:

  • 🔹 Root Mean Squared Error (RMSE)

    Measures average error between observed peak flows and estimated quantiles from the distribution: $\( \text{RMSE} = \sqrt{ \frac{1}{n} \sum (Q_{\text{obs}} - Q_{\text{est}})^2 } \)$ Lower values indicate a better fit.

  • 🔹 Kolmogorov–Smirnov (KS) Statistic

    Measures the maximum difference between the empirical cumulative distribution function (ECDF) and the theoretical CDF: $\( D = \sup_x |F_n(x) - F(x)| \)$

    • Returns both the KS statistic and a p-value

    • If p-value > 0.05: distribution is a statistically valid fit (✅ Pass)


Output Summary#

  • A sorted summary table of all distributions including:

    • Fitted parameters

    • RMSE

    • KS statistic and p-value

    • Pass/fail interpretation

  • Interactive flood frequency plots for return periods on a log-x axis

  • Ability to choose which distribution best represents the dataset


Applications#

  • Floodplain mapping

  • Hydraulic structure design (culverts, bridges, dams)

  • Return period–based risk estimation

  • Hydrologic modeling calibration


Let me know if you’d like this tool extended with confidence intervals, percentile shading, or exported reports in Excel or PDF!

2. Simulation#

import pandas as pd
from io import StringIO

# Simulate the file content as a string
data = """\
agency_cd   site_no peak_dt peak_va
USGS    7022500 3/3/1953    780
USGS    7022500 1/20/1954   520
USGS    7022500 3/20/1955   846
USGS    7022500 2/2/1956    440
USGS    7022500 1/22/1957   707
USGS    7022500 11/18/1957  763
USGS    7022500 8/6/1959    514
USGS    7022500 7/3/1960    602
USGS    7022500 3/12/1961   304
USGS    7022500 9/16/1962   833
USGS    7022500 3/4/1963    228
USGS    7022500 3/4/1964    690
USGS    7022500 3/29/1965   1870
USGS    7022500 3/20/1968   545
USGS    7022500 4/9/1969    702
USGS    7022500 6/13/1970   295
USGS    7022500 8/21/1971   987
USGS    7022500 7/28/1972   350
USGS    7022500 5/1/1973    816
USGS    7022500 11/24/1973  900
USGS    7022500 3/29/1975   1160
USGS    7022500 2/17/1976   800
USGS    7022500 6/26/1977   748
USGS    7022500 3/14/1978   728
USGS    7022500 12/3/1978   1460
USGS    7022500 3/17/1980   364
USGS    7022500 6/6/1981    975
USGS    7022500 1/4/1982    1080
USGS    7022500 6/3/1983    2760
USGS    7022500 4/29/1984   954
USGS    7022500 9/5/1985    660
USGS    7022500 5/24/1986   360
USGS    7022500 2/28/1987   415
"""

# Read it as a DataFrame using tab separator
df = pd.read_csv(StringIO(data), sep="\t")
#df = pd.read_csv(StringIO(data), delim_whitespace=True)
df.columns = df.columns.str.strip()

display(df.head())
# Display the first few rows
#df.head()

print(df['peak_va'])

#peak_df["peak_va"]
agency_cd site_no peak_dt peak_va
0 USGS 7022500 3/3/1953 780
1 USGS 7022500 1/20/1954 520
2 USGS 7022500 3/20/1955 846
3 USGS 7022500 2/2/1956 440
4 USGS 7022500 1/22/1957 707
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'peak_va'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[3], line 51
     47 display(df.head())
     48 # Display the first few rows
     49 #df.head()
---> 51 print(df['peak_va'])

File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'peak_va'
### Required Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import kstest
from sklearn.metrics import mean_squared_error
from ipywidgets import interact, Dropdown
from IPython.display import display
#pip install scipy
### Load NWIS Peak Flow Data
# def read_nwis_peak_file(file_path):
#     try:
#         with open(file_path, 'r') as f:
#             lines = f.readlines()
#         start_line = next(i for i, line in enumerate(lines) if not line.startswith('#'))
#         df = pd.read_csv(
#             file_path,
#             sep='\t',
#             comment='#',
#             header=0,
#             dtype=str,
#             engine='python'
#         )
#         df.columns = df.columns.str.strip()
#         df['peak_dt'] = pd.to_datetime(df['peak_dt'], errors='coerce')
#         df['peak_va'] = pd.to_numeric(df['peak_va'], errors='coerce')
#         df_clean = df[['site_no', 'peak_dt', 'peak_va']].dropna()
#         return df_clean
#     except Exception as e:
#         print(f"❌ Error reading file: {e}")
#         return pd.DataFrame()

### Load & Preview Data
# usgs_file = "07022500_nwis_peak.txt"
# peak_df = read_nwis_peak_file(usgs_file)
# display(peak_df.head())

### Set Up Distribution Parameters
distributions = {
    "Gumbel (EV1)": stats.gumbel_r,
    "Log-Pearson III": stats.pearson3,
    "GEV": stats.genextreme,
    "Normal": stats.norm,
    "Lognormal": stats.lognorm,
    "Weibull": stats.weibull_min,
    "Exponential": stats.expon,
    "Gamma": stats.gamma,
    "Loglogistic": stats.fisk,
    "Generalized Pareto": stats.genpareto
}

###  Define Probability Array
#peak_values = peak_df['peak_va'] #peak_df['peak_va'].dropna()
peak_values=df['peak_va']
sorted_data = np.sort(peak_values)
prob_plot = np.linspace(0.01, 0.99, 100)
return_periods = 1 / (1 - prob_plot)
prob_exceed = 0.01  # For 100-year flood estimate

###  Fit Distributions
summary_rows = []
fit_results = {}

for name, dist in distributions.items():
    try:
        params = dist.fit(peak_values)
        flood_q = dist.ppf(1 - prob_exceed, *params)
#        q_estimates = dist.ppf(prob_plot, *params)
        # Generate quantiles from same size as data
        prob_plot = np.linspace(0.01, 0.99, len(sorted_data))  # now it's 33 points
        q_estimates = dist.ppf(prob_plot, *params)

        
        rmse = np.sqrt(mean_squared_error(sorted_data, q_estimates))
        ks_stat, ks_pval = kstest(peak_values, dist.cdf, args=params)

        fit_results[name] = {
            "params": params,
            "q": dist.ppf(1 - 1 / return_periods, *params),
            "rmse": rmse,
            "ks_stat": ks_stat,
            "ks_pval": ks_pval
        }

        param_str = ", ".join([f"{p:.2f}" for p in params])
        summary_rows.append({
            "Distribution": name,
            "Parameters": param_str,
            "RMSE (cfs)": round(rmse, 2),
            "KS Stat": round(ks_stat, 3),
            "KS p-value": round(ks_pval, 3),
            "KS Result": "✅ Pass" if ks_pval > 0.05 else "❌ Reject"
        })

    except Exception as e:
        print(f"⚠️ Could not fit {name}: {e}")

### Summary Table
summary_df = pd.DataFrame(summary_rows).sort_values(by="RMSE (cfs)")
print("\n📊 Goodness-of-Fit Summary for All Distributions:\n")
display(summary_df)

### Interactive Plotting
def plot_selected_distribution(dist_name):
    result = fit_results[dist_name]
    q = result["q"]
    params = result["params"]
    param_str = ", ".join([f"{p:.2f}" for p in params])

    plt.figure(figsize=(8, 5))
    plt.plot(return_periods, q, marker='o', linestyle='-', color='royalblue', label="Estimated Peak Flow")
    plt.xscale('log')
    plt.xlabel("Return Period (years, log scale)")
    plt.ylabel("Estimated Peak Flow (cfs)")
    plt.title(f"{dist_name} Flood Frequency Curve\nParameters: {param_str}")
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.legend()
    plt.tight_layout()
    plt.show()

    # Tabular Output
    df_plot = pd.DataFrame({
        "Return Period (yr)": return_periods.round(1),
        "Estimated Peak Flow (cfs)": q.round(2)
    })
    print(f"\n📌 Parameters: {param_str}")
    print(f"RMSE: {result['rmse']:.2f}, KS stat: {result['ks_stat']:.3f}, p-value: {result['ks_pval']:.3f}")
    display(df_plot)

### Launch Widget
interact(plot_selected_distribution, dist_name=Dropdown(options=list(fit_results.keys()), description="Distribution"))

3. Self-Assessment#

Self-Assessment: Flood Frequency Analysis Tool#

Use these prompts and questions to evaluate your understanding of the tool and its underlying hydrologic and statistical concepts.


Conceptual Questions#

  1. Why are return periods plotted on a logarithmic scale in flood frequency analysis?

    • Hint: Think about how frequent vs. rare events are distributed.

  2. What is the purpose of fitting multiple distributions to the same peak flow dataset?

    • Hint: No single distribution fits all scenarios equally well.

  3. How do Gringorten plotting positions help in flood frequency analysis?

    • Hint: They’re used to assign empirical probabilities to ordered data.

  4. What assumptions underlie the use of the Gumbel distribution in hydrology?

    • Hint: It’s designed to model block maxima like annual peak flows.

  5. How do parametric and non-parametric flood frequency methods differ in their approach?

    • Hint: Consider how the data distribution is treated.


Reflective Prompts#

  1. If two distributions yield similar RMSE but different KS p-values, which metric is more important for selecting a model—and why?

  2. Can a statistically good-fitting distribution be inappropriate for design applications? Provide an example.

  3. How would you adapt this tool to process data from multiple gage stations simultaneously?

  4. What limitations might this tool face when applied to future climate-affected streamflow patterns?

  5. How would the analysis change if you used partial-duration series instead of annual maxima?


Quiz Questions#

Q1. The Gumbel distribution is commonly used to model:
A. Rainfall intensity
B. Annual maximum values
C. Median flow durations
D. Baseflow during drought
Correct: B


Q2. The Kolmogorov–Smirnov test compares:
A. Log and normal distributions
B. ECDF and theoretical CDF
C. Mean annual rainfall
D. Number of peaks above threshold
Correct: B


Q3. In the Generalized Extreme Value distribution, the shape parameter controls:
A. Peak discharge
B. Tail behavior
C. Cumulative runoff
D. Frequency of low flows
Correct: B


Q4. A high KS p-value and low RMSE suggest:
A. Overfitting
B. Good model fit
C. Poor data resolution
D. Statistical bias
Correct: B


Q5. Which distribution is least appropriate for positively skewed hydrologic data?
A. Gumbel
B. Lognormal
C. Normal
D. Log-Pearson III
Correct: C