Database description and data preparation

1 Description of the database

The database for this project is structured to support hydrological modeling under both historical conditions and future climate change scenarios. It is organized into two main components: observed in-situ data and simulated climate model data.

1.1 Observed Data (In-Situ)

The file dataobs_P_TEMP_ETP_DEBIT.csv contains historical observations used for model calibration and validation. It typically includes:

P: Precipitation (mm)
TEMP: Air Temperature (°C)
ETP: Potential Evapotranspiration (mm)
DEBIT: River discharge or spring flow (m^3/s)

1.2 Climate model data (projections)

The database includes climate projections from two different RCMs, organized into specific folders:

CNRM: National Centre for Meteorological Research (France).
SMHI_MPI: Swedish Meteorological and Hydrological Institute – Max Planck Institute Earth System Model

Each model folder contains several CSV files categorized by variable and scenario:

PR: Precipitation
TMAX: Maximum Temperature
TMIN: Minimum Temperature

Scenarios:

hist (Historical): Reference simulations for the historical period.
rcp45 (RCP 4.5): A moderate climate change scenario in which greenhouse gas emissions are gradually reduced, leading to a stabilization of global warming around a radiative forcing of 4.5 W/m² after 2100.
rcp85 (RCP 8.5): A high-emission scenario assuming no significant mitigation efforts, resulting in a strong increase in greenhouse gas emissions and a radiative forcing exceeding 8.5 W/m² by 2100.

Climate data file naming convention

The climate files follow a standardized naming convention as below: [model]_data[Variable]_[Scenario].csv.

Example: cnrm_dataTMAX_rcp85.csv represents the Maximum Temperature projections for the Lez basin using the CLM model under the pessimistic RCP8.5 scenario.

2 Historical (observational) data preparation and time windowing

In this section, we subset the observed data to a specific historical reference period (1977–2005). Data manipulation is performed using the tidyverse suite of packages, and date handling is managed efficiently using the lubridate package.

Load necessary libraries:

library(dplyr)
library(lubridate)

Load Data:

# Set Github repo. url to access data
github_repo = paste0("https://raw.githubusercontent.com/",
                     "diopBachir/tp_modhydroclimat/main/")

# read the CSV and set columns names [Date, P, TEMP, ETP, DEBIT]
obs_url = file.path(github_repo, "data/dataobs_P_TEMP_ETP_DEBIT.csv")

# set columns names and classes
colnames = c("Date", "P", "T", "PET", "Q")
colclasses = c("character", "numeric", "numeric", "numeric", "numeric")

# import data
obsdata = read.csv(obs_url, col.names = colnames, colClasses = colclasses)

# Convert the Date column to R-Date objects
obsdata[["Date"]] = as.Date(obsdata[["Date"]], format = "%Y%m%d")

# print data
head(obsdata)

        Date P      T PET     Q
1 1975-01-02 0 3.2090 1.9 0.274
2 1975-01-03 0 3.3290 1.3 0.252
3 1975-01-04 0 3.4040 1.3 0.243
4 1975-01-05 0 3.6430 1.1 0.243
5 1975-01-06 0 3.5825 0.9 0.250
6 1975-01-07 0 3.7240 1.1 0.245

Define Time Windows:

# Historical period: 1977-01-01 to 2005-12-31
hist_start = as.Date("1977-01-01")
hist_end = as.Date("2005-12-31")

# Subset historical data
obs_hist = obsdata %>%
  filter(Date >= hist_start, Date <= hist_end)

We convert streamfolow from m^3.s_{-1} to mm. Converting streamflow from a volume rate (m^3.s{-1}) to a depth (mm) is required by almost all hydrological models, and allows for a direct comparison with precipitation and evapotranspiration. To do this, we normalize the flow by the catchment area.

Mathematical conversion

To convert discharge Q (in m^3/s) to a runoff depth R (in mm/day), we use the following formula: R_{mm/day} = \frac{Q_{m^3/s} \times 86400 \times 1000}{Area_{m^2}}

Where:

86,400: Number of seconds in a day.
1,000: Conversion factor from meters to millimeters.
Area: The catchment area in square meters.

If the catchment area is expressed in km^2 (which is standard), the formula simplifies to:

R_{mm/day} = \frac{Q_{m^3/s} \times 86.4}{Area_{km^2}}

The 86.4 Constant

The constant 86.4 is derived from \frac{86,400 \text{ sec/day} \times 1,000 \text{ mm/m}}{1,000,000 \text{ m}^2\text{/km}^2}.

To convert the model output (mm/day) back to volumetric flow (m^3/s):

Q_{m^3/s} = \frac{R_{mm/day} \times Area_{km^2}}{86.4}

A = 178
obs_hist[["Q"]] = (obs_hist[["Q"]] * 86.4) / A

3 Climate model data preparation and time windowing

We first define the temporal window used for climate projections.

# Define projection period (project 1977-2005 window to 2070)
duration = hist_end - hist_start
fut_start = ymd("2070-01-01")
fut_end = fut_start + duration

To automate the processing of our climate model folders (CLM, CNRM, IPSL), we have created a helper function named prepare_climate_data(...) that handles file paths, date filtering, and variable merging. Since our data is split into separate files for PR, TMAX, and TMIN, the function will join these variables based on the date to return a clean, usable dataset. The chunk below source this function from the Github repository:

source(file.path(github_repo, "helper_functions/prepare_climate_data.R"))

We can now call this function for each of our climate models. Note that all the variables used below to call the prepare_climate_data function have been defined in the previous chunks, namely github_repo, hist_start, hist_end, fut_start, fut_end.

# define the data path
climdatapath = file.path(github_repo, "data")

# Process all models

cnrm_data = prepare_climate_data(
    "cnrm", climdatapath, hist_start, hist_end, fut_start, fut_end
)

shmi_mpi_data = prepare_climate_data(
    "smhi_mpi", climdatapath, hist_start, hist_end, fut_start, fut_end
)

4 Exporting preprocessed data

Effective data management requires saving intermediate datasets to establish a reliable baseline for subsequent modeling steps, including bias correction and hydrological simulations here. By exporting the filtered historical observations and harmonized climate projections, we:

Avoid repeating computationally intensive preprocessing routines.
Minimize the risk of temporal inconsistencies or misalignments.
Maintain a transparent audit trail of all transformations applied to the data.

This approach not only improves computational efficiency in future R sessions but also ensures that clean, analysis-ready datasets can be readily shared across software platforms.

The following code creates an output/ directory and saves the observed historical data alongside the processed climate model data. To ensure a seamless transition between data preparation and the subsequent modelling phases, we utilize the RDS (R Data Serialization) format to archive our preprocessed datasets. Exporting the filtered historical observations and the nested lists of climate projections into .rds files is more robust than using flat CSVs, as it preserves R-specific metadata, such as complex list structures and Date-class attributes, while offering significant file compression.

Reloading Preprocessed Data

In the next part of the practical session, the previously exported datasets can be easily reloaded in a single step using the readRDS() function. This approach preserves the original data structure, including column types, ensuring a consistent and analysis-ready dataset without the need to repeat preprocessing steps.

# Create an output directory if it doesn't exist
if (!dir.exists("output_preprocessed")) dir.create("output_preprocessed")

# This keeps all Climate Models in a single file
models_list = list(
    climdataCNRM = cnrm_data,
    climdataSHMI_MPI = shmi_mpi_data
)

preprocessed_data <- list(
  obs_hist = obs_hist,
  models   = models_list
)

# Export using saveRDS
saveRDS(obs_hist, file="output_preprocessed/processed_obsdata_lez.rds")
saveRDS(models_list, file="output_preprocessed/processed_climatedata_lez.rds")