scatteR: Generating instance space based on scagnostics
Janith Wanniarachchi
BSc. Statistics (Hons.)
University of Sri Jayewardenepura
Sri Lanka
Session 36 (Synthetic Data and Text Analysis)
at useR! conference 2022
on the 23rd of June
The late Leland Wilkinson developed graph theory based scagnostics that quantifies the features of a scatterplot into measurements ranging from
0 to 1
As part of my research I needed a way to give marks for different features of a bivariate dataset.
That is where Scagnostics came into play.
As the next step I wanted to generate bivariate datasets which would have scagnostic values that I would expect.
Basically an inverse scagnostics!
Surprisingly there aren't any.
Surprisingly there aren't any.
That's where I found my sweet
research gap
Earlier given N number of (X,Y) coordinate pairs
we got a 9×1 vector of scagnostic values
Earlier given N number of (X,Y) coordinate pairs
we got a 9×1 vector of scagnostic values
Now when we give a 9×1 vector of scagnostic values
we need to get N number of (X,Y) data points!
Sounds pretty simple. We can reverse a function like this
if f(x)=4x+3
Sounds pretty simple. We can reverse a function like this
if f(x)=4x+3
then
f−1(x)=x−34
coutlying=length(Toutliers)length(T)
Here length(Toutliers) is the total length of edges adjacent to outlying points and length(T) is the total length of edges of the final minimum spanning tree
cconvex=w×area(A)area(H)
The convexity measure is based on the ratio of the area of the alpha hull and the area of the convex hull.
cmonotonic=r2Spearman
This is the only measurement not based on the geometrical graphs.
cskinny=1−√4πarea(A)perimeter(A)
The ratio of perimeter to area of a polygon measures, roughly, how skinny it is.
Unlike in f(x)=4x+3
where for every f(x) value
we have a unique distinct x value,
Unlike in f(x)=4x+3
where for every f(x) value
we have a unique distinct x value,
Here there might be multiple datasets that might satisfy
all nine of a given specified scagnostic values.
The idea actually came to me while having desert
The idea actually came to me while having desert
Why don't I first sprinkle a little bit of data points on a 2D plot,
making sure that they land in the right places
and add on top of those sprinkles (data points)
keep on adding more sprinkles
so that the final set of sprinkles (data points) looks good!
Given a set of N number of (X,Y) data points and a 9×1 vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.
Given a set of N number of (X,Y) data points and a 9×1 vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.
Let's define the loss function as,
L([X–– Y––])=1kk∑i=1|si([Di−1[X–– Y––]])−m0i|
where Dt=[Dt−1,X––,Y––], i∈{Outlying,Skewed,…,Monotonic} , si([Di−1;[X–– Y––]]) and m0i is the ith calculated and expected scagnostic measurement respectively.
Given a set of N number of (X,Y) data points and a 9×1 vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement.
Let's define the loss function as,
L([X–– Y––])=1kk∑i=1|si([Di−1[X–– Y––]])−m0i|
where Dt=[Dt−1,X––,Y––], i∈{Outlying,Skewed,…,Monotonic} , si([Di−1;[X–– Y––]]) and m0i is the ith calculated and expected scagnostic measurement respectively.
The name of the algorithm comes from annealing in material sciences, a technique involving heating and controlled cooling of a material to alter its physical properties.
The algorithm works by setting an initial temperature value and decreasing the temperature gradually towards zero.
As the temperature is decreased the algorithm becomes greedier in selecting the optimal solution.
In each time step, the algorithm selects a solution closer to the current solution and would accept the new solution based on the quality of the solution and the temperature dependent acceptance probabilities.
library(scatteR)library(tidyverse)df <- scatteR(measurements = c("Monotonic" = 0.9),n_points = 200,error_var = 9)qplot(data=df,x=x,y=y)
The simulated annealing component is achieved through the GenSA package.
Y. Xiang, et al (2013). Generalized Simulated Annealing for Efficient Global Optimization: the GenSA Package for R.
Here I will be showcasing the documentation and the arguments that are available and what each of those arguments mean.
Here I will be talking about the effects of changing different parameters.
Here I will be talking about the time complexity of scatteR and the ways to speed the generation process
A generative art based on the bivariate numeric relationships of the palmerpenguins dataset
Better optimization methods
Parallelized implementations
Replacing the relevant R code to C++ code
and many more so stay tuned!
For awarding me with the diversity scholarship that gave me the financial strength to speak before you
Email: janithcwanni@gmail.com
Twitter: @janithcwanni
Github: @janithwanni
Linkedin: Janith Wanniarachchi
Try scatteR at https://github.com/janithwanni/scatteR
Slides available at: https://scatter-use-r-2022.netlify.app/
Created with xaringan and xaringan themer
The following content were created by the respective creators and not my work.
The late Leland Wilkinson developed graph theory based scagnostics that quantifies the features of a scatterplot into measurements ranging from
0 to 1
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |