class: center, middle, first-slide, inverse <span style="font-family: 'Lato', sans-serif; font-weight: 900; font-size: 36pt;color:#FFF"> scatteR: Generating instance space based on scagnostics </span> <hr style="width:60%;margin: 0 auto;color:#FFF"/> <span style="font-family: 'Lato', sans-serif;font-size: 24pt;font-weight:400;color:#FFF"> Janith Wanniarachchi <br/> <span style="font-weight:400;font-size: 18pt;">BSc. Statistics (Hons.)</span><br> <span style="font-weight:300;font-size: 17pt;">University of Sri Jayewardenepura</span><br> <span style="font-weight:300;font-size: 15pt;">Sri Lanka</span> </span> <br><br><br> <span style="font-family: 'Lato', sans-serif;font-size: 14pt;font-weight:600;color:#FFF"> Session 36 (Synthetic Data and Text Analysis) <br> at useR! conference 2022 <br> on the 23rd of June </span> --- class: center, middle ### What exactly are these scagnostics? .pull-left[ <img src="imgs/leland_prof_pic.png" style="width:110%" /> The late Leland Wilkinson developed graph theory based scagnostics that quantifies the features of a scatterplot into measurements ranging from <br> 0 to 1 ] <!-- .pull-left[From this:<img src="imgs/scag_splom.png">] --> <!-- .pull-right[To this:<img src="imgs/scag_scag_splom.png">] --> .pull-right[<img src="imgs/scag_measures_concat.jpg" style="width:90%">] --- class: middle .pull-left[ ### An example scatterplot <!-- --> ] .pull-right[ ### The scagnostics <!-- --> ] --- ## How does scagnostics work? .pull-left[ <img src="imgs/scagnostics flowchart.png"> ] .pull-right[ <img src="imgs/mst_example.png" style="width:50%"><img src="imgs/alpha_hull_example.png" style="width:50%"> .center[<img src="imgs/convex_hull_example.png" style="width:50%">] ] --- # So how did I end up here? .pull-left[ As part of my research I needed a way to give marks for different features of a bivariate dataset. That is where Scagnostics came into play. .center[ <img src="https://c.tenor.com/-1b16JKvULIAAAAC/tletter.gif" /> ] ] .pull-right[ As the next step I wanted to generate bivariate datasets which would have scagnostic values that I would expect. Basically an inverse scagnostics! <br><br> .center[ <img src="imgs/genius meme.jpg" /> ] ] --- class: center # Are there any existing solutions to this? -- Surprisingly there aren't any. .center[<img src="imgs/limited by tech meme.jpg" style="width:60%"/>] -- That's where I found my sweet _<span style="font-size:24pt;color: var(--secondary);font-weight:600">research gap<span>_ --- class: center, middle ## How do we generate data from this? -- Earlier given `\(N\)` number of `\((X,Y)\)` coordinate pairs we got a `\(9 \times 1\)` vector of scagnostic values -- Now when we give a `\(9\times 1\)` vector of scagnostic values we need to get `\(N\)` number of `\((X,Y)\)` data points! --- class: center, middle # So we have to reverse a function right? Sounds pretty simple. We can reverse a function like this if $$ f(x) = 4x + 3 $$ -- then $$f^{-1}(x) = \frac{x - 3}{4} $$ --- ### But how do we actually calculate scagnostics? .pull-left[ <h3 style="color:var(--secondary)";> Outlying </h3> `$$c_{\text{outlying}} = \frac{\text{length}(T_{\text{outliers}})}{\text{length}(T)}$$` Here `\(\text{length}(T_{\text{outliers}})\)` is the total length of edges adjacent to outlying points and `\(\text{length}(T)\)` is the total length of edges of the final minimum spanning tree <h3 style="color:var(--secondary)";> Convex </h3> `$$c_{\text{convex}} = w \times \frac{\text{area}(A)}{\text{area}(H)}$$` The convexity measure is based on the ratio of the area of the alpha hull and the area of the convex hull. ] .pull-left[ <h3 style="color:var(--secondary)";> Monotonic </h3> `$$c_{\text{monotonic}} = r^2_{\text{Spearman}}$$` This is the only measurement not based on the geometrical graphs. <h3 style="color:var(--secondary)";> Skinny </h3> `$$c_{\text{skinny}} = 1- \frac{\sqrt{4\pi\text{area}(A)}}{\text{perimeter}(A)}$$` The ratio of perimeter to area of a polygon measures, roughly, how skinny it is. ] --- class: center, middle # But these equations aren't one to one functions! Unlike in `\(f(x) = 4x + 3\)` where for every `\(f(x)\)` value we have a unique distinct `\(x\)` value, -- Here there might be multiple datasets that might satisfy all nine of a given specified scagnostic values. ### This is getting out of hand! <img src="https://i.imgur.com/l8W8wnf.gif" /> --- class: center, middle, first-slide <img src="imgs/ice_cream_sprinkles.jpg" style="position:absolute;top:0px;left:0px;object-fit:fill;scale: 140%;object-position: 0px 90px;"> --- ## Inspiration can come at the hungriest moments The idea actually came to me while having desert .center[<img src="https://c.tenor.com/KnbkykkOK04AAAAC/ice-cream-sprinkle.gif" style="width:40%">] -- > Why don't I first sprinkle a little bit of data points on a 2D plot, > making sure that they land in the right places > and add on top of those sprinkles (data points) > keep on adding more sprinkles > so that the final set of sprinkles (data points) looks good! --- ## But how do we arrange these points in the most optimal manner? Given a set of `\(N\)` number of `\((X,Y)\)` data points and a `\(9\times 1\)` vector of expected scagnostic values, we need to minimize the distance between the scagnostic vector of the current dataset and the expected scagnostic measurement. -- Let's define the loss function as, $$ \let\sb_ $$ $$ L(\left[\underline{X}\text{ }\underline{Y}\right]) = \frac{1}{k}\sum_{i=1}^k |s\sb{i}\Big( `\begin{bmatrix} D\sb{i-1} \\ [\underline{X}\text{ }\underline{Y}] \end{bmatrix}` \Big)-m\sb{0i}| $$ where `\(\mathbf{D_t} = [D_{t-1},\underline{X},\underline{Y}]\)`, `\(i \in \{Outlying, Skewed, \dots, Monotonic\}\)` , `\(s_i(\big[D_{i-1};[\underline{X}\text{ }\underline{Y}]\big])\)` and `\(m_{0i}\)` is the `\(i^{th}\)` calculated and expected scagnostic measurement respectively. -- ### so now we need to find a optimizer for the `\(2N\)` parameters of `\(x_1,x_2,...,x_N,y_1,y_2,...,y_N\)` --- ## Simulated Annealing .pull-left[ <img src="http://www.turingfinance.com/wp-content/uploads/2015/05/Annealing-700x536.jpg" /> The name of the algorithm comes from annealing in material sciences, a technique involving heating and controlled cooling of a material to alter its physical properties. ] .pull-right[ The algorithm works by setting an initial temperature value and decreasing the temperature gradually towards zero. As the temperature is decreased the algorithm becomes greedier in selecting the optimal solution. In each time step, the algorithm selects a solution closer to the current solution and would accept the new solution based on the quality of the solution and the temperature dependent acceptance probabilities. <img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Hill_Climbing_with_Simulated_Annealing.gif" /> ] --- # The algorithm .center[<img src="imgs/scatteR_flowchart.png" style="width:60%">] --- class: center, middle # Introducing <img src="imgs/scatteR.png" style="width:40%"/> .footnote[<small>Install and try it out for yourself from https://github.com/janithwanni/scatteR</small>] --- ## A simple example ```r library(scatteR) library(tidyverse) df <- scatteR(measurements = c("Monotonic" = 0.9),n_points = 200,error_var = 9) qplot(data=df,x=x,y=y) ``` <center> <img src="imgs/scatter_example_plot_1.png" style="width:50%" /> </center> --- class: center # Behind the scenes, <video width="640" height="360" controls> <source src="https://i.imgur.com/WMXKUU0.mp4" type="video/mp4"> </video> <!-- <img src="imgs/scatteR_anim.gif"/> --> The simulated annealing component is achieved through the [GenSA](https://github.com/cran/GenSA) package. .footnote[<small>Y. Xiang, et al (2013). Generalized Simulated Annealing for Efficient Global Optimization: the GenSA Package for R.</small>] --- ### The scatteR() function .center[<img src="imgs/scatteR_help.png" style="width:70%"/>] ??? Here I will be showcasing the documentation and the arguments that are available and what each of those arguments mean. --- # Performance results ### Type wise error .center[<img src="imgs/mae_plot_experiment.jpg"/>] --- ## Is there a special recipe for the hyperparameters .center[<img src="imgs/init_point_plot.jpg" style="width:90%">] ??? Here I will be talking about the effects of changing different parameters. --- # How much time does it take? .center[<img src="imgs/scag_type_value_plot.jpg" style="width:65%"/>] ??? Here I will be talking about the time complexity of scatteR and the ways to speed the generation process --- # In summary, what can scatteR do for you? ### As a teacher, * Generate dummy data for the students to try out new statistical methods * Introduce students to the concept of scagnostics -- ### As a data scientist, * Synthesize small scale numerical datasets for test purposes * Generate dummy data to try out new data science methods -- ### As an everyday R user, * Generating data for an interesting generative art made with R * Generate a quick sample of data to test out a new package that you installed --- <img src="imgs/synth_gen_art.jpeg" /> .center[A generative art based on the bivariate numeric relationships of the palmerpenguins dataset] --- # Where to from here? .pull-left[ * Better optimization methods * Parallelized implementations * Replacing the relevant R code to C++ code * and many more so stay tuned! ] .pull-right[ <img src="imgs/caterpie_moon.jpeg" style="width:140%;position:relative;top:-154px;max-width:none"/> ] --- class: center, middle <br/> # Thank you for listening! .pull-left[ ## Thank you to my supervisor <img src="https://pbs.twimg.com/profile_images/1459464211713691652/SqkWoTBr_400x400.jpg" style="border-radius:200px;width:40%"> Dr. Thiyanga Talagala Check out her work on Github [@thiyangt](https://github.com/thiyangt/) <br> ] .pull-right[ ## Thank you to useR! 2022 and sponsors <img src="https://pbs.twimg.com/profile_images/1331908205677522944/NrWYReTJ_400x400.jpg" style="border-radius:200px;width:40%"> For awarding me with the diversity scholarship that gave me the financial strength to speak before you ] --- class: center, middle <h2 style="color:#5b85aa">Have any follow up questions?</h2> Email: [janithcwanni@gmail.com](#) Twitter: [@janithcwanni](https://twitter.com/janithcwanni) Github: [@janithwanni](https://github.com/janithwanni) Linkedin: [Janith Wanniarachchi](https://www.linkedin.com/in/janith-wanniarachchi-462851117/) <hr style="width:40%"> Try scatteR at [https://github.com/janithwanni/scatteR](https://github.com/janithwanni/scatteR) Slides available at: [https://scatter-use-r-2022.netlify.app/](https://scatter-use-r-2022.netlify.app/) <br> Created with xaringan and xaringan themer --- # Acknowledgements The following content were created by the respective creators and not my work. * Image of Ice cream sprinkles (Slide #17): Photo by <a href="https://unsplash.com/@calavera?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">David Calavera</a> on <a href="https://unsplash.com/@calavera?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> * Image of Annealing (Slide #19): [http://www.turingfinance.com/simulated-annealing-for-portfolio-optimization/](http://www.turingfinance.com/simulated-annealing-for-portfolio-optimization/) * Image of Caterpie staring at moon made by All0412 on deviantart (Slide #24) [https://www.deviantart.com/all0412/art/Caterpie-353761155](https://www.deviantart.com/all0412/art/Caterpie-353761155)