Geometrical Archetypal Analysis: One Step Beyond the Current View

Geometrical Archetypal Analysis (GAA)

At the traditional Archetypal Analysis (AA) [1] we have a data frame df which is a matrix Y with dimension n × d, with n the number of observations (rows) and d the number of variables (columns) or the dimension of the relevant Linear Space Rd.

The output of our AA algorithm which has been implemented at the R Package “archetypal” computes the matrices A, B such that: Y∼ ABY or if we want to be more strict our kappas× d matrix of archetypes BY is such that next squared Frobenius norm is minimum SSE = ∥Y − ABY2 = minimum A (n× kappas) is a are row stochastic matrix and B (kappas× n ) is a column stochastic one.

We also define the Variance Explained as next: $$ varexpl=\frac{SST-SSE}{SST} $$ with SST being the total sum of squares for all elements of matrix Y.

It is a suitable modification of PCHA algorithm, see [2], [3], which now uses the data frames without transposing them and it has full control to all external and internal parameters of it.

But there are many disadvantages at this route:

  • The number of archetypes (“kappas”) is unknown
  • The choice of kappas by using the SSE plot is a very questionable process:
    • The elbow or knee point usually is located at the center of the SSE-kappas curve
    • If you test more kappas, then it increases

But what really stands for the concept of archetypes?

We know that archetypes lie on the outer bound of the Convex Hull that covers all our data points.

Now arise next questions:

  • How many points in the vector space Rd are needed in order to create a minimal Convex Hull for them which can be called “Archetypal Spanning Convex Hull”?
    • The answer is kappas = d + 1
  • If we take the above as the number of archetypes then:
    • What is the percentage of our data points that are covered by that hull?
    • Are we satisfied with a value of 1%?
    • For social sciences data sets we observe an almost ellipsoidal distribution
    • That means we need many archetypes for an acceptable data coverage
    • What should we do in order to increase that data coverage?

The solution to all the above problems is Geometrical Archetypal Analysis (GAA), see also the pdf version [4].

Given the data frame df and its correspondent matrix Y ∈ Rd we follow next steps:

  1. We find the minimum Yj(1) and maximum Yj(ν) values for all variables Yj, j = 1, ..., d
  2. We take the Cartesian Product: {Y1(1), Y1(ν)} × {Y2(1), Y2(ν)} × … × {Yd(1), Yd(ν)}
  3. We define the 2d vectors as the Archetypal Grid, for example if d = 3 we find the 8 rows of next matrix: $$ \begin{matrix}Y1,min&Y2,min&Y3,min \\Y1,max&Y2,min&Y3,min \\Y1,min&Y2,max&Y3,min \\Y1,max&Y2,max&Y3,min \\Y1,min&Y2,min&Y3,max \\Y1,max&Y2,min&Y3,max \\Y1,min&Y2,max&Y3,max \\Y1,max&Y2,max&Y3,max \\\end{matrix} $$
  4. We create a new data frame dg with the Archetypal Grid as the first rows and the initial data frame “df” as the rest ones
  5. We fix the BY matrix of archetypes to be the first 2d rows of dg
  6. We iteratively compute the A-matrix of the Y∼ ABY decomposition
  7. We collect the {2d + 1, … n} rows of that A-matrix: these are the compositions of all initial data rows as convex combinations of Grid Archetypes

Since we are sure that our archetypes cover the 100% of the data points, then we don’t care about their number so much, since what actually matters is the compositions for each point.

Grid Archetypal

We create a set of random 3D data points and then we compute the Grid Archetypal Analysis.

library(GeomArchetypal)
# Create random data
set.seed(20140519)
df=matrix(runif(300) , nrow = 100, ncol=3) 
colnames(df)=c("x","y","z")
# Grid Archetypal
gaa=grid_archetypal(df, diag_less = 1e-6, niter = 50, verbose = FALSE)
summary(gaa)
## Number of observations / data rows n = 108 
## (The first 8 rows are the Grid Archetypes) 
## Dimension of data variables d = 3 
## Number of Grid Archetypes k = 8 
## 
## Grid Archetypes: 
## 
##            x          y          z
## 1 0.04107952 0.00131741 0.01868682
## 2 0.98128784 0.00131741 0.01868682
## 3 0.04107952 0.99690733 0.01868682
## 4 0.98128784 0.99690733 0.01868682
## 5 0.04107952 0.00131741 0.99566805
## 6 0.98128784 0.00131741 0.99566805
## 7 0.04107952 0.99690733 0.99566805
## 8 0.98128784 0.99690733 0.99566805
## 
## Run details: 
## 
##            SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 5.951504e-06         0.9999999        TRUE         39        0.01

If we plot the data points and the archetypes we observe that the Archetypal Spanning Convex Hull covers all our points, a fact that can be checked by using the function points_inside_convex_hull():

plot(gaa)

points_inside_convex_hull(df, gaa$grid)
## [1] 100

Closer Grid Archetypal

For the same data set as before we want to set as archetypes the Closer to the Grid Archetypes just because we’d like our archetypes to belong to the data set.

# Closer Grid Archetypal
cga=closer_grid_archetypal(df,diag_less = 1e-3, niter = 70, verbose = FALSE)
summary(cga)
## Number of observations / data rows n = 100 
## Dimension of data variables d = 3 
## Number of Closer Grid Archetypes k = 8 
## The rows that formed the Closer Grid Archetypes are: 
## 52,69,18,87,6,14,39,5 
## 
## The Closer Grid Archetypes are: 
## 
##            x          y          z
## 1 0.13352227 0.15353714 0.07159086
## 2 0.75890322 0.09126450 0.01868682
## 3 0.07234262 0.79526378 0.06748076
## 4 0.88561538 0.94945905 0.10888000
## 5 0.31213541 0.12031532 0.87912934
## 6 0.96420042 0.07130253 0.77504836
## 7 0.24115414 0.88179278 0.90609350
## 8 0.89296529 0.95993797 0.94319426
## 
## The Run details are: 
## 
##         SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 0.3367711         0.9965373        TRUE         26           0
plot(cga)

What is the percentage of data coverage for the Closer Grid Archetypes?

pc=points_inside_convex_hull(df,cga$aa$BY)
print(pc)
## [1] 59

The is a relatively high percentage compared to the usually very low values of less than 10%, especially if we seek for a small number of archetypes.

Fast Archetypal

If we decide that we’ll take a specific set of rows as archetypes, then we can use the function fast-archetypal() to compute the A-matrix of convex coefficients for the rest of data points.

It is the generic function that is used from both grid_archetypal() and closer_grid_archetypal() for their main computations.

Here we can use the rows we have found just before as the rows for the archetypes.

# Fast Archetypal
fa=fast_archetypal(df, irows = cga$grid_rows, diag_less = 1e-6,
                   niter = 100, verbose = FALSE)
summary(fa)
## Number of observations / data rows n = 100 
## Dimension of data variables d = 3 
## Number of archetypes k = 8 
## 
## Archetypes: 
## 
##            x          y          z
## 1 0.13352227 0.15353714 0.07159086
## 2 0.75890322 0.09126450 0.01868682
## 3 0.07234262 0.79526378 0.06748076
## 4 0.88561538 0.94945905 0.10888000
## 5 0.31213541 0.12031532 0.87912934
## 6 0.96420042 0.07130253 0.77504836
## 7 0.24115414 0.88179278 0.90609350
## 8 0.89296529 0.95993797 0.94319426
## 
## Run details: 
## 
##         SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 0.3305205         0.9966016        TRUE         57           0
## 
## Call:
## fast_archetypal(df = df, irows = cga$grid_rows, diag_less = 1e-06, 
##     niter = 100, verbose = FALSE)
## NULL
plot(fa)

par(oldpar)
options(oldoptions)

As it was expected the results are totally identical with the previously found ones.

References

[1] Cutler, A., & Breiman, L. (1994). Archetypal Analysis. Technometrics, 36(4), 338–347. https://doi.org/10.1080/00401706.1994.10485840

[2] M Morup and LK Hansen, “Archetypal analysis for machine learning and data mining”, Neurocomputing (Elsevier, 2012). https://doi.org/10.1016/j.neucom.2011.06.033.

[3] Source: https://mortenmorup.dk/?page_id=2 , last accessed 2024-03-09

[4] Christopoulos, DT, (2024) “Geometrical Archetypal Analysis: One Step Beyond the Current View”, http://dx.doi.org/10.13140/RG.2.2.14030.88642 .