Multivariate data



Juan C. Rocha

Multivariate data

“simultaneous observation and analysis of more than one outcome variable”

Multivariate data

   Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu
1         1        0        0        0        0        0        0        0
2         3        0        0        2        0        3        4        0
3         0        4        0        7        0        2        0        0
4         0        8        0        2        0        2        3        0
5         2        0        0        0        4        2        2        0
6         2        0        0        0        3        0        0        0
7         2        0        0        0        2        0        2        0
8         0        4        0        5        0        0        0        0
9         0        3        0        3        0        0        0        0
10        4        0        0        0        4        2        4        0
11        0        0        0        0        0        0        0        0
12        0        4        0        8        0        0        0        0
13        0        5        0        5        0        0        0        1
14        0        4        0        0        0        0        0        0
15        0        4        0        0        0        0        0        0
16        0        7        0        4        0        0        0        0
17        2        0        2        0        4        0        0        0
18        0        0        0        0        0        2        0        0
19        0        0        3        0        4        0        0        0
20        0        5        0        0        0        0        0        0
   Cirsarve Comapalu Eleopalu Elymrepe Empenigr Hyporadi Juncarti Juncbufo
1         0        0        0        4        0        0        0        0
2         0        0        0        4        0        0        0        0
3         0        0        0        4        0        0        0        0
4         2        0        0        4        0        0        0        0
5         0        0        0        4        0        0        0        0
6         0        0        0        0        0        0        0        0
7         0        0        0        0        0        0        0        2
8         0        0        4        0        0        0        4        0
9         0        0        0        6        0        0        4        4
10        0        0        0        0        0        0        0        0
11        0        0        0        0        0        2        0        0
12        0        0        0        0        0        0        0        4
13        0        0        0        0        0        0        0        3
14        0        2        4        0        0        0        0        0
15        0        2        5        0        0        0        3        0
16        0        0        8        0        0        0        3        0
17        0        0        0        0        0        2        0        0
18        0        0        0        0        0        0        0        0
19        0        0        0        0        2        5        0        0
20        0        0        4        0        0        0        4        0
   Lolipere Planlanc Poaprat Poatriv Ranuflam Rumeacet Sagiproc Salirepe
1         7        0       4       2        0        0        0        0
2         5        0       4       7        0        0        0        0
3         6        0       5       6        0        0        0        0
4         5        0       4       5        0        0        5        0
5         2        5       2       6        0        5        0        0
6         6        5       3       4        0        6        0        0
7         6        5       4       5        0        3        0        0
8         4        0       4       4        2        0        2        0
9         2        0       4       5        0        2        2        0
10        6        3       4       4        0        0        0        0
11        7        3       4       0        0        0        2        0
12        0        0       0       4        0        2        4        0
13        0        0       2       9        2        0        2        0
14        0        0       0       0        2        0        0        0
15        0        0       0       0        2        0        0        0
16        0        0       0       2        2        0        0        0
17        0        2       1       0        0        0        0        0
18        2        3       3       0        0        0        0        3
19        0        0       0       0        0        0        3        3
20        0        0       0       0        4        0        0        5
   Scorautu Trifprat Trifrepe Vicilath Bracruta Callcusp
1         0        0        0        0        0        0
2         5        0        5        0        0        0
3         2        0        2        0        2        0
4         2        0        1        0        2        0
5         3        2        2        0        2        0
6         3        5        5        0        6        0
7         3        2        2        0        2        0
8         3        0        2        0        2        0
9         2        0        3        0        2        0
10        3        0        6        1        2        0
11        5        0        3        2        4        0
12        2        0        3        0        4        0
13        2        0        2        0        0        0
14        2        0        6        0        0        4
15        2        0        1        0        4        0
16        0        0        0        0        4        3
17        2        0        0        0        0        0
18        5        0        2        1        6        0
19        6        0        2        0        3        0
20        2        0        0        0        4        3

dune dataset from the package vegan shows meadow vegetation of 30 species in 20 sites.

SES data is multidimensional

dune.env has environmenta data for the 20 sites

     A1 Moisture Management      Use Manure
1   2.8        1         SF Haypastu      4
2   3.5        1         BF Haypastu      2
3   4.3        2         SF Haypastu      4
4   4.2        2         SF Haypastu      4
5   6.3        1         HF Hayfield      2
6   4.3        1         HF Haypastu      2
7   2.8        1         HF  Pasture      3
8   4.2        5         HF  Pasture      3
9   3.7        4         HF Hayfield      1
10  3.3        2         BF Hayfield      1
11  3.5        1         BF  Pasture      1
12  5.8        4         SF Haypastu      2
13  6.0        5         SF Haypastu      3
14  9.3        5         NM  Pasture      0
15 11.5        5         NM Haypastu      0
16  5.7        5         SF  Pasture      3
17  4.0        2         NM Hayfield      0
18  4.6        1         NM Hayfield      0
19  3.7        5         NM Hayfield      0
20  3.5        5         NM Hayfield      0

Data types

Type Features Notation Category Distribution
Count data (N) Discrete (countable) Nominal data Qualitative Non parametric
Count with 2 groups (dichotomous) Discrete categorical, non-rankable Nominal data Qualitative Non parametric
Count with > 2 groups Discrete categorical, non-rankable Nominal data Qualitative Non parametric
Count with rankable groups Discrete categorical rankable Ordinal data Qualitative Non parametric
Measurements (but not true zero) Continuous Interval data Quantitative Parametric
Relative measurements (true zero) Relative continuous Interval ratio data Quantitative Parametric
Time series Continuous data linked by a time vector Interval data along a time series Quantitative Parametric

From Fetzer et al (2022)

Multivariate analysis

  1. Multivariate analysis of variance (MANOVA)
  2. Clustering & ordination
  3. Principal component analysis (PCA)
  4. Non-metric multidimensional scaling (nMDS)
  5. Redundancy analysis (RDA)
  6. Canonical correspondence analysis (CCA)
  7. Factor Analysis (FA)
  8. Multiple correspondence analysis (MCA)

Clustering

“grouping sets of objects that are more similar to each other than objects in other groups (or clusters)”

Clustering

  • hierarchical
  • k-means
  • diana: divisive hierarchical
  • pam: partition around mediods
  • clara: sampling based pam
  • fanny: fuzzy clustering with partial memberships
  • som: self organizing maps
  • model based: mixture of Gaussian distributions to fit data (EM)
  • sota: self-organizing tree algorithm

Example: k-means

out <- kmeans(as.matrix(dune), centers = 5)

out
K-means clustering with 5 clusters of sizes 4, 4, 4, 4, 4

Cluster means:
  Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu
1      0.5        0     1.25     0.00     2.00     0.50     0.00     0.00
2      0.0        5     0.00     1.00     0.00     0.00     0.00     0.00
3      0.0        4     0.00     5.25     0.00     0.00     0.00     0.25
4      1.0        3     0.00     2.75     0.00     1.75     1.75     0.00
5      2.5        0     0.00     0.00     3.25     1.00     2.00     0.00
  Cirsarve Comapalu Eleopalu Elymrepe Empenigr Hyporadi Juncarti Juncbufo
1      0.0        0     0.00      0.0      0.5     2.25      0.0     0.00
2      0.0        1     5.25      0.0      0.0     0.00      2.5     0.00
3      0.0        0     1.00      1.5      0.0     0.00      2.0     2.75
4      0.5        0     0.00      4.0      0.0     0.00      0.0     0.00
5      0.0        0     0.00      1.0      0.0     0.00      0.0     0.50
  Lolipere Planlanc Poaprat Poatriv Ranuflam Rumeacet Sagiproc Salirepe
1     2.25      2.0    2.00    0.00      0.0      0.0     1.25     1.50
2     0.00      0.0    0.00    0.50      2.5      0.0     0.00     1.25
3     1.50      0.0    2.50    5.50      1.0      1.0     2.50     0.00
4     5.75      0.0    4.25    5.00      0.0      0.0     1.25     0.00
5     5.00      4.5    3.25    4.75      0.0      3.5     0.00     0.00
  Scorautu Trifprat Trifrepe Vicilath Bracruta Callcusp
1     4.50     0.00     1.75     0.75     3.25      0.0
2     1.50     0.00     1.75     0.00     3.00      2.5
3     2.25     0.00     2.50     0.00     2.00      0.0
4     2.25     0.00     2.00     0.00     1.00      0.0
5     3.00     2.25     3.75     0.25     3.00      0.0

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 4  4  4  4  5  5  5  3  3  5  1  3  3  2  2  2  1  1  1  2 

Within cluster sum of squares by cluster:
[1] 144.25 115.25 141.00 164.25 112.50
 (between_SS / total_SS =  57.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

Example: hierarchical

out <- hclust(dist(dune), method = "average")
plot(out)

Important considerations

  1. Jargon mine: e.g. dissimilarity
  2. Any algorithm is sensitive to the distance metric used.
    • Presence / absence data: Jaccard, Sorensen
    • Gower, Bray–Curtis, Jaccard and Kulczynski indices are good in detecting underlying ecological gradients
    • Morisita, Horn–Morisita, Binomial, Cao and Chao indices should be able to handle different sample sizes
    • Mountford and Raup-Crick indices for presence–absence data should be able to handle unknown (and variable) sample sizes
    • For guidance see help("vegdist") and vignette of vegan package

Important considerations

  1. Clustering is sensitive to the raw units of the data (distributions), data transformations can help.
    • Log-transform heavy distributions (e.g. income)
    • Zero-inflated distributions
    • Re-scale to zero mean, unit variance

Decisions depend on case-by-case basis, but the idea is to reduce influence of outliers or non-informative observations

Sensitivity and validation

How many clusters to fit?

  • NbClust package test 10 different methods
  • Compare 30 different metrics of performance
library(NbClust)
library(clValid)
help("NbClust")

Real world example

Mapping social-ecological systems archetypes

Rocha et al., 2020. ERL

Ostrom’s heritage

  • Challenge of SDGs: how do we find context dependent solutions?
  • There is no panaceas!!
  • Social-ecological systems framework
    • 2-tier variables (n=53)
  • Over 100 case study coded but local in temporal and spatial scales:

To develop a data driven method to upscale Ostrom’s SES framework

Volta river basin

  • West African Sahel is vulnerable area due to:
    • wide-spread of poverty
    • recurrent droughts and dry spells
    • political upheaval
    • growing food demand
  • Volta basin is 2/3 Ghana and Burkina Faso
    • N: dry, poor, subsistence agriculture
    • S: wet, rich, urbanization

Ostrom’s framework

FWFig

1st tier 2nd tier Indicators
Socio-economic and political settings (S) S2-Demographic trends Population trend
Inter regional migration
Intra regional migration
S5-Market Incentives Market access
Resource System (RS) RS4-Human constructed facilities Dams
RS7-Predictability of system’s dynamics Variance of production (kcals)
Resource Units (RU) RU5-Number of units Cattle per \(km^2\)
Small ruminants per capita
Users (U) U1-Number of users Population density (persons/\(km^2\))
Ratio of farmers (%)
U2-Socioeconomic attributes Ratio of children (% < 14yr)
Ratio of woman (%)
Literacy (%)
Related Ecosystems (ECO) ECO1-Climate patterns Aridity
Mean temperature (C)
ECO3-Flows Soil water
Wet season (months precip. > 60mm)
Slope 75%
Interactions (I) I1-Harvesting levels Kilo calories for diverse crops

How many clusters, which algorithm?

Lessons

Lear more: Rocha et al., 2020 Mapping Social-Ecological System Archetypes. ERL

Further reading

Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1-22. URL https://www.jstatsoft.org/v25/i04/

Malika Charrad, Nadia Ghazzali, Veronique Boiteau, Azam Niknafs (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36. URL http://www.jstatsoft.org/v61/i06/.