Animated Linear Projections

 
nicolas
Nicholas Spyrison
Department of Human-Centered Computing Department of Econometrics and Business Statistics Monash University, Australia

Nick’s research focus is on multivariate data visualization, dimension reduction, and nonlinear model interpretability. He is also the author of the R packages ‘spinifex’ and ‘cheem’.

Website: https://nspyrison.netlify.com
Github: https://github.com/nspyrison
Linkedin: https://www.linkedin.com/in/nspyrison/

Introduction

Suppose you are at the doctor’s office for a general check-up. How many measures or variables do you think were measured? Certainly age, height, weight, blood pressure, and heart rate — maybe blood oxygenation or iron levels in the case of a blood test. The visualization of quantitative multivariate data quickly becomes difficult as the dimensionality or number of measures increases. The visualization of data conveys much more information than numerical summaries allow and is vital for every field of work. First, let us consider how we can visualize dimensions.

Traditional multivariate visualization

Consider how we visualize data. For single variables, histograms and densities are used to show the spread. Scatterplots can be used to relate information from 2 or 3 measures simultaneously. After that, we have to mix it up. Taking an exhaustive approach, we could consider an arbitrary, p histogram densities. Similarly, we could look at all pairs and triples of scatterplots. This is the crux of scatterplot matrices (Chambers et al.1983). As the number of dimensions increases, the number of panes produced from this increases exponentially. This also has another limitation; it is not able to show information from more than 2 or 3 dimensions at a time.

Another method is to link each observation with a line and display variables horizontally. Then after normalizing or standardizing each variable to a common scale, they can be viewed as parallel coordinate plots (Ocagne, 1885). This visual will scale linearly with dimensions but has a few downsides. The amount of ink each observation uses is relatively high, it's hard to extract information such as correlation, and leads to asymmetric interpretation when the order of the variables changes.

Projections

As the dimensionality of the data increases, the analyst will want to turn to dimension reduction as projections. This involves a function that maps the data from a larger space to a smaller space. This is separated into two categories, linear and nonlinear. The linear case spans all affine mathematical transformations, essentially any function where parallel lines stay parallel. Nonlinear transformations complement the linear case, think transformations containing exponents or interacting terms.  

Intuitively, linear projections can be thought of as 2D shadows of 3D objects. Holding the light and background constant, the shape of the shadow is determined by the object and its orientation. The orientation and map of the data are defined in a matrix called the basis, A where Ynxd = Xnxp x Apxd. As a sidenote, we restrict these bases to be orthonormal (columns at right angles and of length 1). One common linear projection is that of principal component analysis that provides a reorientation basis ordered by decreasing variation (found with eigenvector decomposition) (Pearson1901). Figure 1 shows two linear projections of penguins data (Gorman et al., 2014). The left frame is informative to the separation of clusters and the right is much less so.

cl_sep

Figure 1: Two basis orientations of penguins data. Some orientations are more informative than others.

Nonlinear projections that can bend and distort spaces are not entirely accurate or faithful to the original variable space. There are various quality metrics, such as Trustworthiness, Continuity, Normalized stress, and Average local error that have been introduced to describe the distortion of the space (Espadoto et al.2021). Unfortunately, these metrics are hard to visualize and communicate, making the distortions introduced opaque to the analyst. The intuition of this can be demonstrated with map projections. Snyder (Snyder1987) lists over 200 different projections that distort the surface of the earth to display as a 2D map, each with unique properties and use cases. The added subjectivity of choosing a nonlinear method and its parameters, and the difficulty of understanding how space is distorted make it hard to interpret variable influences. These factors potentially obscure the signal in the data.

Traditionally linear projections are used to approximate the data in fewer dimensions than the original. This is well and good, but can still leave a gap in visualizing the data. That is, the data approximated down to the first d principal components must be visualized thoroughly rather than just looking at the first 2 or 3 components, as these lower components still regularly hold signals, such as in clustered data (Donnell et al., 1994) or high-throughput genomics data.

Animated linear projections

Going back to the shadow analogy, the analyst can discern information about the shape of the unknown data object by observing the shadow as the object is rotated. In the case of a bar stool, some profiles are relatively uninformative such as a circle from the seat, which could come from many items. However, if the object continually rotates, the legs will show in the shadow, conveying a much better idea of the object’s shape. This is the intuition for tours. Tours are linear projections animated over small changes to the basis. By viewing the data over differing orientations, the analyst gleans information about the data: which orientations contain interesting features such as cluster separation or unique profile shapes. Another key aspect is the permanence of observation between nearby frames. That is to say, the analyst can see structure by watching small relative changes between consecutive frames. In contrast, the typical case tries to relate two or more distant bases with no intermittent information conveyed.

There are various types of tours distinguished by the selection of their target bases (Cook et al., 2008). The grand tour animates between randomly selected target frames while a guided tour performs projection pursuit, optimizing for some index function. The manual tour allows for the analyst to manually control the contribution of a selected variable. The manual tour is demonstrated in Figure 2, where the contribution of one variable is being changed which explains most of the separation between the orange and green clusters. Lee et al. (2021) do a great job of discussing recent advances and as they remain a topic of interest. Tours can be produced in the R packages tourr (Wickham et al., 2011) and spinifex (Spyrison and Cook, 2020).

radial

Figure 2: A manual tour varying the contribution of one variable (bill length). As this variable is removed the orange and green clusters overlap. Because of this, we say that this cluster separation is sensitive to this variable. The animated version can be viewed at https://vimeo.com/676723431.

In summary, visualizing quantitative multivariate data remains a difficult task as the dimensions increase. Projections must be used at some point. Traditionally spaces are approximated in fewer dimensions, though these are not often viewed sufficiently. Tours, animated linear projections, can reveal more information than static linear projections by looking any more orientations the data as the basis changes. Recent articles and R packages further discuss and facilitate the production of tours.

 

References:

  1.  Chambers, J., Cleveland, W., Kleiner, B., Tukey, P., 1983. Graphical Methods for Data Analysis .
  2. Cook, D., Buja, A., Lee, E.K., Wickham, H., 2008. Grand Tours, Projection Pursuit Guided Tours, and Manual Controls, in: Handbook of Data Visualization. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 295–314. URL: http://link.springer.com/10.1007/978-3-540-33037-0_13, doi:10.1007/978-3-540-33037-0_13.
  3. Donnell, D.J., Buja, A., Stuetzle, W., 1994. Analysis of Additive Dependencies and Concurvities Using Smallest Additive Principal Components. The Annals of Statistics 22, 1635 – 1668. URL: https://doi.org/10.1214/aos/1176325746.
  4. Espadoto, M., Martins, R.M., Kerren, A., Hirata, N.S.T., Telea, A.C., 2021. Toward a Quantitative Survey of Dimension Reduction Techniques. IEEE Transactions on Visualization and Computer Graphics 27, 2153–2173. doi:10.1109/TVCG.2019.2944182.
  5.  Gorman, K.B., Williams, T.D., Fraser, W.R., 2014. Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PloS one 9, e90081.
  6. Lee, S., Cook, D., da Silva, N., Laa, U., Spyrison, N., Wang, E., Zhang, H.S., 2021. The state-of-the-art on tours for dynamic visualization of high-dimensional data. WIREs Computational Statistics ,e1573URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/wics.1573, doi:10.1002/wics.1573.
  7.  Ocagne, M.d., 1885. Coordonne’es paralle’les et axiales. Me’thode de transformation ge’ome’trique et proce’de’ nouveau de calcul graphique de’duits de la conside’ration des coordonne’es paralle’les. Gauthier-Villars, Paris.
  8.  Pearson, K., 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572.
  9.  Snyder, J.P., 1987. Map projections–A working manual. volume 1395. US Government Printing Office.Spyrison, N., Cook, D., 2020. spinifex: an R Package for Creating a Manual Tour of Low-dimensional Projections of Multivariate Data. The R Journal 12, 243. URL: https://journal.r-project.org/archive/2020/RJ-2020-027/index.html, doi:10.32614/RJ-2020-027.
  10. Wickham, H., Cook, D., Hofmann, H., Buja, A., 2011. tourr : An R Package for Exploring Multivariate Data with Projections. Journal of Statistical Software 40. URL: http://www.jstatsoft.org/v40/i02/, doi:10.18637/jss.v040.i02.