项目作者: hsbadr

项目描述 :
Hierarchical Climate Regionalization
高级语言: R
项目地址: git://github.com/hsbadr/HiClimR.git
创建时间: 2015-02-22T23:45:37Z
项目社区:https://github.com/hsbadr/HiClimR

开源协议:GNU General Public License v3.0

下载


HiClimR

Lifecycle: Stable
Commits since release
Last commit
R

CRAN Status
CRAN Downloads
License: GPL v3
DOI: 10.1007/s12145-015-0221-7

HiClimR: Hierarchical Climate Regionalization

Table of Contents

Introduction

HiClimR is a tool for Hierarchical Climate Regionalization applicable to any correlation-based clustering. Climate regionalization is the process of dividing an area into smaller regions that are homogeneous with respect to a specified climatic metric. Several features are added to facilitate the applications of climate regionalization (or spatiotemporal analysis in general) and to implement a cluster validation function with an objective tree cutting to find an optimal number of clusters for a user-specified confidence level. These include options for preprocessing and postprocessing as well as efficient code execution for large datasets and options for splitting big data and computing only the upper-triangular half of the correlation/dissimilarity matrix to overcome memory limitations. Hybrid hierarchical clustering reconstructs the upper part of the tree above a cut to get the best of the available methods. Multivariate clustering (MVC) provides options for filtering all variables before preprocessing, detrending and standardization of each variable, and applying weights for the preprocessed variables.

Features

HiClimR adds several features and a new clustering method (called, regional linkage) to hierarchical clustering in R (hclust function in stats library) including:

  • data regridding
  • coarsening spatial resolution
  • geographic masking
    • by continents
    • by regions
    • by countries
  • contiguity-constrained clustering
  • data filtering by thresholds
    • mean threshold
    • variance threshold
  • data preprocessing
    • detrending
    • standardization
    • PCA
  • faster correlation function
    • splitting big data matrix
    • computing upper-triangular matrix
    • using optimized BLAS library on 64-Bit machines
      • ATLAS
      • OpenBLAS
      • Intel MKL
  • different clustering methods
    • regional linkage or minimum inter-regional correlation
    • ward‘s minimum variance or error sum of squares method
    • single linkage or nearest neighbor method
    • complete linkage or diameter
    • average linkage, group average, or UPGMA method
    • mcquitty‘s or WPGMA method
    • median, Gower’s or WPGMC method
    • centroid or UPGMC method
  • hybrid hierarchical clustering
    • the upper part of the tree is reconstructed above a cut
    • the lower part of the tree uses user-selected method
    • the upper part of the tree uses regional linkage method
  • multivariate clustering (MVC)
    • filtering all variables before preprocessing
    • detrending and standardization of each variable
    • applying weight for the preprocessed variables
  • cluster validation
    • summary statistics based on raw data or the data reconstructed by PCA
    • objective tree cut using minimum significant correlation between region means
  • visualization of regionalization results
  • exporting region map and mean timeseries into NetCDF-4

The regional linkage method is explained in the context of a spatiotemporal problem, in which N spatial elements (e.g., weather stations) are divided into k regions, given that each element has a time series of length M. It is based on inter-regional correlation distance between the temporal means of different regions (or elements at the first merging step). It modifies the update formulae of average linkage method by incorporating the standard deviation of the merged region timeseries, which is a function of the correlation between the individual regions, and their standard deviations before merging. It is equal to the average of their standard deviations if and only if the correlation between the two merged regions is 100%. In this special case, the regional linkage method is reduced to the classic average linkage clustering method.

Implementation

Badr et al. (2015) describes the regionalization algorithms, features, and data processing tools included in the package and presents a demonstration application in which the package is used to regionalize Africa on the basis of interannual precipitation variability. The figure below shows a detailed flowchart for the package. Cyan blocks represent helper functions, green is input data or parameters, yellow indicates agglomeration Fortran code, and purple shows graphics options. For multivariate clustering (MVC), the input data is a list of matrices (one matrix for each variable with the same number of rows to be clustered; the number of columns may vary per variable). The blue dashed boxes involve a loop for all variables to apply mean and/or variance thresholds, detrending, and/or standardization per variable before weighing the preprocessed variables and binding them by columns in one matrix for clustering. x is the input N x M data matrix, xc is the coarsened N0 x M data matrix where N0 ≤ N (N0 = N only if lonStep = 1 and latStep = 1), xm is the masked and filtered N1 x M1 data matrix where N1 ≤ N0 (N1 = N0 only if the number of masked stations/points is zero) and M1 ≤ M (M1 = M only if no columns are removed due to missing values), and x1 is the reconstructed N1 x M1 data matrix if PCA is performed.

HiClimR Flowchart

HiClimR is applicable to any correlation-based clustering.

Installation

There are many ways to install an R package from precompiled binaries or source code. For more details, you may search for how to install an R package, but here are the most convenient ways to install HiClimR:

From CRAN

This is the easiest way to install an R package on Windows, Mac, or Linux. You just fire up an R shell and type:

  1. install.packages("HiClimR")

In theory the package should just install, however, you may be asked to select your local mirror (i.e. which server should you use to download the package). If you are using R-GUI or R-Studio, you can find a menu for package installation where you can just search for HiClimR and install it.

From GitHub

This is intended for developers and requires a development environment (compilers, libraries, … etc) to install the latest development release of HiClimR. On Linux and Mac, you can download the source code and use R CMD INSTALL to install it. In a convenient way, you may use pak as follows:

  • Install pak from CRAN:
  1. install.packages("pak")
  • Make sure you have a working development environment:

    • Windows: Install Rtools.
    • Mac: Install Xcode from the Mac App Store.
    • Linux: Install a compiler and various development libraries (details vary across different flavors of Linux).
  • Install HiClimR from GitHub source:

  1. pak::pkg_install("hsbadr/HiClimR")

Source

The source code repository can be found on GitHub at hsbadr/HiClimR.

License

HiClimR is licensed under GPL v3. The code is modified by Hamada S. Badr from src/library/stats/R/hclust.R part of R package Copyright © 1995-2021 The R Core Team.

  • This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

  • This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

A copy of the GNU General Public License is available at https://www.r-project.org/Licenses.

Copyright © 2013-2021 Earth and Planetary Sciences (EPS), Johns Hopkins University (JHU).

Citation

To cite HiClimR in publications, please use:

  1. citation("HiClimR")

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015):
A Tool for Hierarchical Climate Regionalization, Earth Science Informatics,
8(4), 949-958, https://doi.org/10.1007/s12145-015-0221-7.

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014):
HiClimR: Hierarchical Climate Regionalization, Comprehensive R Archive Network (CRAN),
https://cran.r-project.org/package=HiClimR.

History

Version Date Comment Author Email
May 1992 Original F. Murtagh
Dec 1996 Modified Ross Ihaka
Apr 1998 Modified F. Leisch
Jun 2000 Modified F. Leisch
1.0.0 03/07/14 HiClimR Hamada S. Badr badr@jhu.edu
1.0.1 03/08/14 Updated Hamada S. Badr badr@jhu.edu
1.0.2 03/09/14 Updated Hamada S. Badr badr@jhu.edu
1.0.3 03/12/14 Updated Hamada S. Badr badr@jhu.edu
1.0.4 03/14/14 Updated Hamada S. Badr badr@jhu.edu
1.0.5 03/18/14 Updated Hamada S. Badr badr@jhu.edu
1.0.6 03/25/14 Updated Hamada S. Badr badr@jhu.edu
1.0.7 03/30/14 Hybrid Hamada S. Badr badr@jhu.edu
1.0.8 05/06/14 Updated Hamada S. Badr badr@jhu.edu
1.0.9 05/07/14 CRAN Hamada S. Badr badr@jhu.edu
1.1.0 05/15/14 Updated Hamada S. Badr badr@jhu.edu
1.1.1 07/14/14 Updated Hamada S. Badr badr@jhu.edu
1.1.2 07/26/14 Updated Hamada S. Badr badr@jhu.edu
1.1.3 08/28/14 Updated Hamada S. Badr badr@jhu.edu
1.1.4 09/01/14 Updated Hamada S. Badr badr@jhu.edu
1.1.5 11/12/14 Updated Hamada S. Badr badr@jhu.edu
1.1.6 03/01/15 GitHub Hamada S. Badr badr@jhu.edu
1.2.0 03/27/15 MVC Hamada S. Badr badr@jhu.edu
1.2.1 05/24/15 Updated Hamada S. Badr badr@jhu.edu
1.2.2 07/21/15 Updated Hamada S. Badr badr@jhu.edu
1.2.3 08/05/15 Updated Hamada S. Badr badr@jhu.edu
2.0.0 12/22/18 NOTE Hamada S. Badr badr@jhu.edu
2.1.0 01/01/19 NetCDF Hamada S. Badr badr@jhu.edu
2.1.1 01/02/19 Updated Hamada S. Badr badr@jhu.edu
2.1.2 01/04/19 Updated Hamada S. Badr badr@jhu.edu
2.1.3 01/10/19 Updated Hamada S. Badr badr@jhu.edu
2.1.4 01/20/19 Updated Hamada S. Badr badr@jhu.edu
2.1.5 12/10/19 inherits Hamada S. Badr badr@jhu.edu
2.1.6 02/22/20 Updated Hamada S. Badr badr@jhu.edu
2.1.7 11/05/20 Updated Hamada S. Badr badr@jhu.edu
2.1.8 01/04/21 Updated Hamada S. Badr badr@jhu.edu

Examples

Single-Variate Clustering

  1. library(HiClimR)
  1. #----------------------------------------------------------------------------------#
  2. # Typical use of HiClimR for single-variate clustering: #
  3. #----------------------------------------------------------------------------------#
  4. ## Load the test data included/loaded in the package (1 degree resolution)
  5. x <- TestCase$x
  6. lon <- TestCase$lon
  7. lat <- TestCase$lat
  8. ## Generate/check longitude and latitude mesh vectors for gridded data
  9. xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
  10. lon <- c(xGrid$lon)
  11. lat <- c(xGrid$lat)
  12. ## Single-Variate Hierarchical Climate Regionalization
  13. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
  14. continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
  15. standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  16. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  17. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  18. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  1. #----------------------------------------------------------------------------------#
  2. # Additional Examples: #
  3. #----------------------------------------------------------------------------------#
  4. ## Use Ward's method
  5. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
  6. continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
  7. standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  8. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  9. validClimR = TRUE, k = 5, minSize = 1, alpha = 0.01,
  10. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  11. ## Use data splitting for big data
  12. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
  13. continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
  14. standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
  15. members = NULL, nSplit = 10, upperTri = TRUE, verbose = TRUE,
  16. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  17. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  18. ## Use hybrid Ward-Regional method
  19. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
  20. continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
  21. standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
  22. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  23. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  24. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  25. ## Check senitivity to kH for the hybrid method above

Multivariate Clustering

  1. #----------------------------------------------------------------------------------#
  2. # Typical use of HiClimR for multivariate clustering: #
  3. #----------------------------------------------------------------------------------#
  4. ## Load the test data included/loaded in the package (1 degree resolution)
  5. x1 <- TestCase$x
  6. lon <- TestCase$lon
  7. lat <- TestCase$lat
  8. ## Generate/check longitude and latitude mesh vectors for gridded data
  9. xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
  10. lon <- c(xGrid$lon)
  11. lat <- c(xGrid$lat)
  12. ## Test if we can replicate single-variate region map with repeated variable
  13. y <- HiClimR(x=list(x1, x1), lon = lon, lat = lat, lonStep = 1, latStep = 1,
  14. geogMask = FALSE, continent = "Africa", meanThresh = list(10, 10),
  15. varThresh = list(0, 0), detrend = list(TRUE, TRUE), standardize = list(TRUE, TRUE),
  16. nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  17. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  18. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  19. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  20. ## Generate a random matrix with the same number of rows
  21. x2 <- matrix(rnorm(nrow(x1) * 100, mean=0, sd=1), nrow(x1), 100)
  22. ## Multivariate Hierarchical Climate Regionalization
  23. y <- HiClimR(x=list(x1, x2), lon = lon, lat = lat, lonStep = 1, latStep = 1,
  24. geogMask = FALSE, continent = "Africa", meanThresh = list(10, NULL),
  25. varThresh = list(0, 0), detrend = list(TRUE, FALSE), standardize = list(TRUE, TRUE),
  26. weightMVC = list(1, 1), nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  27. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  28. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  29. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  30. ## You can apply all clustering methods and options

Miscellaneous Examples

  1. #----------------------------------------------------------------------------------#
  2. # Miscellaneous examples to provide more information about functionality and usage #
  3. # of the helper functions that can be used separately or for other applications. #
  4. #----------------------------------------------------------------------------------#
  5. ## Load test case data
  6. x <- TestCase$x
  7. ## Generate longitude and latitude mesh vectors
  8. xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
  9. lon <- c(xGrid$lon)
  10. lat <- c(xGrid$lat)
  11. ## Coarsening spatial resolution
  12. xc <- coarseR(x = x, lon = lon, lat = lat, lonStep = 2, latStep = 2)
  13. lon <- xc$lon
  14. lat <- xc$lat
  15. x <- xc$x
  16. ## Use fastCor function to compute the correlation matrix
  17. t0 <- proc.time(); xcor <- fastCor(t(x)); proc.time() - t0
  18. ## compare with cor function
  19. t0 <- proc.time(); xcor0 <- cor(t(x)); proc.time() - t0
  20. ## Check the valid options for geographic masking
  21. geogMask()
  22. ## geographic mask for Africa
  23. gMask <- geogMask(continent = "Africa", lon = lon, lat = lat, plot = TRUE,
  24. colPalette = NULL)
  25. ## Hierarchical Climate Regionalization Without geographic masking
  26. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
  27. continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
  28. standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  29. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  30. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  31. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  32. ## With geographic masking (you may specify the mask produced above to save time)
  33. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
  34. continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
  35. standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  36. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  37. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  38. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  39. ## With geographic masking and contiguity constraint
  40. ## Change contigConst as appropriate
  41. y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
  42. continent = "Africa", contigConst = 1, meanThresh = 10, varThresh = 0, detrend = TRUE,
  43. standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
  44. members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
  45. validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
  46. plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
  47. ## Find minimum significant correlation at 95% confidence level
  48. rMin <- minSigCor(n = nrow(x), alpha = 0.05, r = seq(0, 1, by = 1e-06))
  49. ## Validtion of Hierarchical Climate Regionalization
  50. z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE, colPalette = NULL)
  51. ## Apply minimum cluster size (minSize = 25)
  52. z <- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE, colPalette = NULL)
  53. ## The optimal number of clusters, including small clusters
  54. k <- length(z$clustFlag)
  55. ## The selected number of clusters, after excluding small clusters (if minSize > 1)
  56. ks <- sum(z$clustFlag)
  57. ## Dendrogram plot
  58. plot(y, hang = -1, labels = FALSE)
  59. ## Tree cut
  60. cutTree <- cutree(y, k = k)
  61. table(cutTree)
  62. ## Visualization for gridded data
  63. RegionsMap <- matrix(y$region, nrow = length(unique(y$coords[, 1])), byrow = TRUE)
  64. colPalette <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
  65. "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
  66. image(unique(y$coords[, 1]), unique(y$coords[, 2]), RegionsMap, col = colPalette(ks))
  67. ## Visualization for gridded or ungridded data
  68. plot(y$coords[, 1], y$coords[, 2], col = colPalette(max(y$region, na.rm = TRUE))[y$region], pch = 15, cex = 1)
  69. ## Change pch and cex as appropriate!
  70. ## Export region map and mean timeseries into NetCDF-4 file
  71. library(ncdf4)
  72. y.nc <- HiClimR2nc(y=y, ncfile="HiClimR.nc", timeunit="years", dataunit="mm")
  73. ## The NetCDF-4 file is still open to add other variables or close it
  74. nc_close(y.nc)