scatteR: Generating instance space based on scagnostics

Janith Wanniarachchi
BSc. Statistics (Hons.)
University of Sri Jayewardenepura
Sri Lanka

Session 36 (Synthetic Data and Text Analysis)
at useR! conference 2022
on the 23rd of June

1 / 28

What exactly are these scagnostics?

The late Leland Wilkinson developed graph theory based scagnostics that quantifies the features of a scatterplot into measurements ranging from
0 to 1

2 / 28

An example scatterplot

The scagnostics

3 / 28

How does scagnostics work?

4 / 28

So how did I end up here?

As part of my research I needed a way to give marks for different features of a bivariate dataset.

That is where Scagnostics came into play.

As the next step I wanted to generate bivariate datasets which would have scagnostic values that I would expect.

Basically an inverse scagnostics!

5 / 28

Are there any existing solutions to this?6 / 28

Are there any existing solutions to this?

Surprisingly there aren't any.

6 / 28

Are there any existing solutions to this?

Surprisingly there aren't any.

That's where I found my sweet

research gap

6 / 28

How do we generate data from this?7 / 28

How do we generate data from this?

Earlier given $N$ number of $(X, Y)$ coordinate pairs

we got a $9 \times 1$ vector of scagnostic values

7 / 28

How do we generate data from this?

Earlier given $N$ number of $(X, Y)$ coordinate pairs

we got a $9 \times 1$ vector of scagnostic values

Now when we give a $9 \times 1$ vector of scagnostic values

we need to get $N$ number of $(X, Y)$ data points!

7 / 28

So we have to reverse a function right?

Sounds pretty simple. We can reverse a function like this

if $f (x) = 4 x + 3$

8 / 28

So we have to reverse a function right?

Sounds pretty simple. We can reverse a function like this

if $f (x) = 4 x + 3$

then

$f^{- 1} (x) = \frac{x - 3}{4}$

8 / 28

But how do we actually calculate scagnostics?

Outlying

$c_{outlying} = \frac{length (T_{outliers})}{length (T)}$

Here $length (T_{outliers})$ is the total length of edges adjacent to outlying points and $length (T)$ is the total length of edges of the final minimum spanning tree

Convex

$c_{convex} = w \times \frac{area (A)}{area (H)}$

The convexity measure is based on the ratio of the area of the alpha hull and the area of the convex hull.

Monotonic

$c_{monotonic} = r_{Spearman}^{2}$

This is the only measurement not based on the geometrical graphs.

Skinny

$c_{skinny} = 1 - \frac{\sqrt{4 π area (A)}}{perimeter (A)}$

The ratio of perimeter to area of a polygon measures, roughly, how skinny it is.

9 / 28

But these equations aren't one to one functions!

Unlike in $f (x) = 4 x + 3$

where for every $f (x)$ value

we have a unique distinct $x$ value,

10 / 28

But these equations aren't one to one functions!

Unlike in $f (x) = 4 x + 3$

where for every $f (x)$ value

we have a unique distinct $x$ value,

Here there might be multiple datasets that might satisfy

all nine of a given specified scagnostic values.

This is getting out of hand!

10 / 28

11 / 28

Inspiration can come at the hungriest moments

The idea actually came to me while having desert

12 / 28

Inspiration can come at the hungriest moments

The idea actually came to me while having desert

Why don't I first sprinkle a little bit of data points on a 2D plot,

making sure that they land in the right places

and add on top of those sprinkles (data points)

keep on adding more sprinkles

so that the final set of sprinkles (data points) looks good!

12 / 28

But how do we arrange these points in the most optimal manner?

Given a set of $N$ number of $(X, Y)$ data points and a $9 \times 1$ vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.

13 / 28

But how do we arrange these points in the most optimal manner?

Let's define the loss function as,

$L ([\underline{X} \underline{Y}]) = \frac{1}{k} \sum_{i = 1}^{k} | s_{i} ([\begin{matrix} D_{i - 1} \\ [\underline{X} \underline{Y}] \end{matrix}]) - m_{0 i} |$

where $D_{t} = [D_{t - 1}, \underline{X}, \underline{Y}]$ , $i \in {O u t l y i n g, S k e w e d, \dots, M o n o t o n i c}$ , $s_{i} ([D_{i - 1}; [\underline{X} \underline{Y}]])$ and $m_{0 i}$ is the $i^{t h}$ calculated and expected scagnostic measurement respectively.

13 / 28

But how do we arrange these points in the most optimal manner?

Let's define the loss function as,

$L ([\underline{X} \underline{Y}]) = \frac{1}{k} \sum_{i = 1}^{k} | s_{i} ([\begin{matrix} D_{i - 1} \\ [\underline{X} \underline{Y}] \end{matrix}]) - m_{0 i} |$

so now we need to find a optimizer for the $2 N$ parameters of $x_{1}, x_{2}, . . ., x_{N}, y_{1}, y_{2}, . . ., y_{N}$

13 / 28

Simulated Annealing

The name of the algorithm comes from annealing in material sciences, a technique involving heating and controlled cooling of a material to alter its physical properties.

The algorithm works by setting an initial temperature value and decreasing the temperature gradually towards zero.

As the temperature is decreased the algorithm becomes greedier in selecting the optimal solution.

In each time step, the algorithm selects a solution closer to the current solution and would accept the new solution based on the quality of the solution and the temperature dependent acceptance probabilities.

14 / 28

The algorithm

15 / 28

Introducing

Install and try it out for yourself from https://github.com/janithwanni/scatteR

16 / 28

A simple example

library(scatteR)
library(tidyverse)
df <- scatteR(measurements = c("Monotonic" = 0.9),n_points = 200,error_var = 9)
qplot(data=df,x=x,y=y)

17 / 28

Behind the scenes,

The simulated annealing component is achieved through the GenSA package.

Y. Xiang, et al (2013). Generalized Simulated Annealing for Efficient Global Optimization: the GenSA Package for R.

18 / 28

The scatteR() function

19 / 28

Here I will be showcasing the documentation and the arguments that are available and what each of those arguments mean.

Performance results

Type wise error

20 / 28

Is there a special recipe for the hyperparameters

21 / 28

Here I will be talking about the effects of changing different parameters.

How much time does it take?

22 / 28

Here I will be talking about the time complexity of scatteR and the ways to speed the generation process

In summary, what can scatteR do for you?As a teacher,Generate dummy data for the students to try out new statistical methods
Introduce students to the concept of scagnostics
23 / 28

In summary, what can scatteR do for you?As a teacher,Generate dummy data for the students to try out new statistical methods
Introduce students to the concept of scagnostics
As a data scientist,Synthesize small scale numerical datasets for test purposes
Generate dummy data to try out new data science methods
23 / 28

In summary, what can scatteR do for you?As a teacher,Generate dummy data for the students to try out new statistical methods
Introduce students to the concept of scagnostics
As a data scientist,Synthesize small scale numerical datasets for test purposes
Generate dummy data to try out new data science methods
As an everyday R user,Generating data for an interesting generative art made with R
Generate a quick sample of data to test out a new package that you installed
23 / 28

A generative art based on the bivariate numeric relationships of the palmerpenguins dataset

24 / 28

Where to from here?

Better optimization methods
Parallelized implementations
Replacing the relevant R code to C++ code
and many more so stay tuned!

25 / 28

Thank you for listening!

Thank you to my supervisor

Dr. Thiyanga Talagala

Check out her work on Github @thiyangt

Thank you to useR! 2022 and sponsors

For awarding me with the diversity scholarship that gave me the financial strength to speak before you

26 / 28

Have any follow up questions?

Email: janithcwanni@gmail.com

Twitter: @janithcwanni

Github: @janithwanni

Linkedin: Janith Wanniarachchi

Try scatteR at https://github.com/janithwanni/scatteR

Slides available at: https://scatter-use-r-2022.netlify.app/
Created with xaringan and xaringan themer

27 / 28

Acknowledgements

The following content were created by the respective creators and not my work.

Image of Ice cream sprinkles (Slide #17): Photo by David Calavera on Unsplash
Image of Annealing (Slide #19): http://www.turingfinance.com/simulated-annealing-for-portfolio-optimization/
Image of Caterpie staring at moon made by All0412 on deviantart (Slide #24) https://www.deviantart.com/all0412/art/Caterpie-353761155

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

What exactly are these scagnostics?

An example scatterplot

The scagnostics

How does scagnostics work?

So how did I end up here?

Are there any existing solutions to this?

Are there any existing solutions to this?

Are there any existing solutions to this?

How do we generate data from this?

How do we generate data from this?

How do we generate data from this?

So we have to reverse a function right?

So we have to reverse a function right?

But how do we actually calculate scagnostics?

Outlying

Convex

Monotonic

Skinny

But these equations aren't one to one functions!

But these equations aren't one to one functions!

This is getting out of hand!

Inspiration can come at the hungriest moments

Inspiration can come at the hungriest moments

But how do we arrange these points in the most optimal manner?

But how do we arrange these points in the most optimal manner?

But how do we arrange these points in the most optimal manner?

so now we need to find a optimizer for the 2N2N parameters of x1,x2,...,xN,y1,y2,...,yNx1,x2,...,xN,y1,y2,...,yN

Simulated Annealing

The algorithm

Introducing

A simple example

Behind the scenes,

The scatteR() function

Performance results

Type wise error

Is there a special recipe for the hyperparameters

How much time does it take?

In summary, what can scatteR do for you?

As a teacher,

In summary, what can scatteR do for you?

As a teacher,

As a data scientist,

In summary, what can scatteR do for you?

As a teacher,

As a data scientist,

As an everyday R user,

Where to from here?

Thank you for listening!

Thank you to my supervisor

Thank you to useR! 2022 and sponsors

Have any follow up questions?

Acknowledgements

What exactly are these scagnostics?

Help

so now we need to find a optimizer for the $2 N$ parameters of $x_{1}, x_{2}, . . ., x_{N}, y_{1}, y_{2}, . . ., y_{N}$