+ - 0:00:00
Notes for current slide
Notes for next slide

scatteR: Generating instance space based on scagnostics


Janith Wanniarachchi
BSc. Statistics (Hons.)
University of Sri Jayewardenepura
Sri Lanka




Session 36 (Synthetic Data and Text Analysis)
at useR! conference 2022
on the 23rd of June

1 / 28

What exactly are these scagnostics?

The late Leland Wilkinson developed graph theory based scagnostics that quantifies the features of a scatterplot into measurements ranging from
0 to 1

2 / 28

An example scatterplot

The scagnostics

3 / 28

How does scagnostics work?

4 / 28

So how did I end up here?

As part of my research I needed a way to give marks for different features of a bivariate dataset.

That is where Scagnostics came into play.

As the next step I wanted to generate bivariate datasets which would have scagnostic values that I would expect.

Basically an inverse scagnostics!

5 / 28

Are there any existing solutions to this?

6 / 28

Are there any existing solutions to this?

Surprisingly there aren't any.

6 / 28

Are there any existing solutions to this?

Surprisingly there aren't any.

That's where I found my sweet

research gap

6 / 28

How do we generate data from this?

7 / 28

How do we generate data from this?

Earlier given N number of (X,Y) coordinate pairs

we got a 9×1 vector of scagnostic values

7 / 28

How do we generate data from this?

Earlier given N number of (X,Y) coordinate pairs

we got a 9×1 vector of scagnostic values

Now when we give a 9×1 vector of scagnostic values

we need to get N number of (X,Y) data points!

7 / 28

So we have to reverse a function right?

Sounds pretty simple. We can reverse a function like this

if f(x)=4x+3

8 / 28

So we have to reverse a function right?

Sounds pretty simple. We can reverse a function like this

if f(x)=4x+3

then

f1(x)=x34

8 / 28

But how do we actually calculate scagnostics?

Outlying

coutlying=length(Toutliers)length(T)

Here length(Toutliers) is the total length of edges adjacent to outlying points and length(T) is the total length of edges of the final minimum spanning tree

Convex

cconvex=w×area(A)area(H)

The convexity measure is based on the ratio of the area of the alpha hull and the area of the convex hull.

Monotonic

cmonotonic=rSpearman2

This is the only measurement not based on the geometrical graphs.

Skinny

cskinny=14πarea(A)perimeter(A)

The ratio of perimeter to area of a polygon measures, roughly, how skinny it is.

9 / 28

But these equations aren't one to one functions!

Unlike in f(x)=4x+3

where for every f(x) value

we have a unique distinct x value,

10 / 28

But these equations aren't one to one functions!

Unlike in f(x)=4x+3

where for every f(x) value

we have a unique distinct x value,

Here there might be multiple datasets that might satisfy

all nine of a given specified scagnostic values.

This is getting out of hand!

10 / 28

11 / 28

Inspiration can come at the hungriest moments

The idea actually came to me while having desert

12 / 28

Inspiration can come at the hungriest moments

The idea actually came to me while having desert

Why don't I first sprinkle a little bit of data points on a 2D plot,

making sure that they land in the right places

and add on top of those sprinkles (data points)

keep on adding more sprinkles

so that the final set of sprinkles (data points) looks good!

12 / 28

But how do we arrange these points in the most optimal manner?

Given a set of N number of (X,Y) data points and a 9×1 vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.

13 / 28

But how do we arrange these points in the most optimal manner?

Given a set of N number of (X,Y) data points and a 9×1 vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.

Let's define the loss function as,

L([X_ Y_])=1ki=1k|si([Di1[X_ Y_]])m0i|

where Dt=[Dt1,X_,Y_], i{Outlying,Skewed,,Monotonic} , si([Di1;[X_ Y_]]) and m0i is the ith calculated and expected scagnostic measurement respectively.

13 / 28

But how do we arrange these points in the most optimal manner?

Given a set of N number of (X,Y) data points and a 9×1 vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.

Let's define the loss function as,

L([X_ Y_])=1ki=1k|si([Di1[X_ Y_]])m0i|

where Dt=[Dt1,X_,Y_], i{Outlying,Skewed,,Monotonic} , si([Di1;[X_ Y_]]) and m0i is the ith calculated and expected scagnostic measurement respectively.

so now we need to find a optimizer for the 2N parameters of x1,x2,...,xN,y1,y2,...,yN

13 / 28

Simulated Annealing

The name of the algorithm comes from annealing in material sciences, a technique involving heating and controlled cooling of a material to alter its physical properties.

The algorithm works by setting an initial temperature value and decreasing the temperature gradually towards zero.

As the temperature is decreased the algorithm becomes greedier in selecting the optimal solution.

In each time step, the algorithm selects a solution closer to the current solution and would accept the new solution based on the quality of the solution and the temperature dependent acceptance probabilities.

14 / 28

The algorithm

15 / 28

Introducing

Install and try it out for yourself from https://github.com/janithwanni/scatteR

16 / 28

A simple example

library(scatteR)
library(tidyverse)
df <- scatteR(measurements = c("Monotonic" = 0.9),n_points = 200,error_var = 9)
qplot(data=df,x=x,y=y)
17 / 28

Behind the scenes,

The simulated annealing component is achieved through the GenSA package.

Y. Xiang, et al (2013). Generalized Simulated Annealing for Efficient Global Optimization: the GenSA Package for R.

18 / 28

The scatteR() function

19 / 28

Here I will be showcasing the documentation and the arguments that are available and what each of those arguments mean.

Performance results

Type wise error

20 / 28

Is there a special recipe for the hyperparameters

21 / 28

Here I will be talking about the effects of changing different parameters.

How much time does it take?

22 / 28

Here I will be talking about the time complexity of scatteR and the ways to speed the generation process

In summary, what can scatteR do for you?

As a teacher,

  • Generate dummy data for the students to try out new statistical methods
  • Introduce students to the concept of scagnostics
23 / 28

In summary, what can scatteR do for you?

As a teacher,

  • Generate dummy data for the students to try out new statistical methods
  • Introduce students to the concept of scagnostics

As a data scientist,

  • Synthesize small scale numerical datasets for test purposes
  • Generate dummy data to try out new data science methods
23 / 28

In summary, what can scatteR do for you?

As a teacher,

  • Generate dummy data for the students to try out new statistical methods
  • Introduce students to the concept of scagnostics

As a data scientist,

  • Synthesize small scale numerical datasets for test purposes
  • Generate dummy data to try out new data science methods

As an everyday R user,

  • Generating data for an interesting generative art made with R
  • Generate a quick sample of data to test out a new package that you installed
23 / 28

A generative art based on the bivariate numeric relationships of the palmerpenguins dataset

24 / 28

Where to from here?

  • Better optimization methods

  • Parallelized implementations

  • Replacing the relevant R code to C++ code

  • and many more so stay tuned!

25 / 28


Thank you for listening!

Thank you to my supervisor

Dr. Thiyanga Talagala

Check out her work on Github @thiyangt


Thank you to useR! 2022 and sponsors

For awarding me with the diversity scholarship that gave me the financial strength to speak before you

26 / 28

Have any follow up questions?

Email: janithcwanni@gmail.com

Twitter: @janithcwanni

Github: @janithwanni

Linkedin: Janith Wanniarachchi


Try scatteR at https://github.com/janithwanni/scatteR

Slides available at: https://scatter-use-r-2022.netlify.app/
Created with xaringan and xaringan themer

27 / 28

Acknowledgements

The following content were created by the respective creators and not my work.

28 / 28

What exactly are these scagnostics?

The late Leland Wilkinson developed graph theory based scagnostics that quantifies the features of a scatterplot into measurements ranging from
0 to 1

2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow