Skip to contents

Introduction

newdata is an R package to generate new data frames for predictive purposes. By default, all specified variables vary across their range while all other variables are held constant at the default reference value. Types, classes, factor levels and time zones are always preserved. The user can specify the length of each sequence, require that only observed values and combinations are used and add new variables.

Consider the following observed data frame.

library(newdata)

obs_data
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr      fct     ord   dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr>    <fct>   <ord> <date>     <dttm>              <time>
#> 1 TRUE      1   1   most     most    most  1970-01-02 1969-12-31 16:00:01 00'01"
#> 2 FALSE     4   4.5 most     most    most  1970-01-05 1969-12-31 16:00:04 00'04"
#> 3 NA        6   8.2 a rarity a rari… a ra… 1970-01-07 1969-12-31 16:00:06 00'06"

Length of Sequences

By default all variables are held constant (length of 1).

xnew_data(obs_data)
#> # A tibble: 1 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

Specifying a variable causes it to vary sequentially across its range.

xnew_data(obs_data, int)
#> # A tibble: 6 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     2  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 4 FALSE     4  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 5 FALSE     5  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 6 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

The user can specify the length of each sequence.

xnew_data(obs_data, xnew_seq(int, length_out = 3))
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

Observed Values and Combinations

The user can also indicate whether only observed values should be used in the sequence.

xnew_data(obs_data, xnew_seq(int, length_out = 3, obs_only = TRUE))
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     4  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

The xobs_only() function can be used to filter out unobserved values after the sequence has been generated.

xnew_data(obs_data, xobs_only(xnew_seq(int, length_out = 3)))
#> # A tibble: 2 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

With two or more variables all combinations are used.

xnew_data(obs_data, int, fct)
#> # A tibble: 18 × 9
#>    lgl     int   dbl chr   fct      ord    dte        dtt                 hms   
#>    <lgl> <int> <dbl> <chr> <fct>    <ord>  <date>     <dttm>              <time>
#>  1 FALSE     1  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  2 FALSE     1  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  3 FALSE     1  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  4 FALSE     2  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  5 FALSE     2  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  6 FALSE     2  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  7 FALSE     3  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  8 FALSE     3  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  9 FALSE     3  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 10 FALSE     4  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 11 FALSE     4  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 12 FALSE     4  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 13 FALSE     5  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 14 FALSE     5  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 15 FALSE     5  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 16 FALSE     6  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 17 FALSE     6  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 18 FALSE     6  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"

To only get observed combinations use xobs_only()

xnew_data(obs_data, xobs_only(int, fct))
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr   fct      ord     dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>    <ord>   <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  most     a rari… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     4  4.57 most  most     a rari… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     6  4.57 most  a rarity a rari… 1970-01-04 1969-12-31 16:00:03 00'03"

Add New Variables

Adding a new variable is simple.

xnew_data(obs_data, new = c(TRUE, FALSE))
#> # A tibble: 2 × 10
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> # ℹ 1 more variable: new <lgl>

Casting Variables

Casting variables is easy.

xnew_data(obs_data, xcast(int = 7, dbl = 10L, fct = "a rarity"))
#> # A tibble: 1 × 9
#>   lgl     int   dbl chr   fct      ord     dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>    <ord>   <date>     <dttm>              <time>
#> 1 FALSE     7    10 most  a rarity a rari… 1970-01-04 1969-12-31 16:00:03 00'03"

Installation

To install the latest release version from CRAN.

install.packages("newdata")

To install the latest development version from r-universe.

install.packages("newdata", repos = c("https://poissonconsulting.r-universe.dev", "https://cloud.r-project.org"))

To install the latest development version from GitHub

# install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch))
pak::pak("poissonconsulting/newdata")

Contribution

Please report any issues.

Pull requests are always welcome.

Code of Conduct

Please note that the newdata project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.