OnlineStats Integration
OnlineStats is a package for calculating statistics and models with online (one observation at a time) parallelizable algorithms. This integrates tightly with JuliaDB's distributed data structures to calculate statistics on large datasets. The full documentation for OnlineStats is available here.
Basics
OnlineStats' objects can be updated with more data and also merged together. The image below demonstrates what goes on under the hood in JuliaDB to compute a statistic s
in parallel.
OnlineStats integration is available via the reduce
and groupreduce
functions. An OnlineStat acts differently from a normal reducer:
- Normal reducer
f
:val = f(val, row)
- OnlineStat reducer
o
:fit!(o, row)
julia> using JuliaDB, OnlineStats
julia> t = table(1:100, rand(Bool, 100), randn(100));
julia> reduce(Mean(), t; select = 3)
Mean: n=100 | value=-0.0693309
julia> grp = groupreduce(Mean(), t, 2; select=3)
Table with 2 rows, 2 columns: 1 2 ─────────────────────────────────── false Mean: n=54 | value=-0.227742 true Mean: n=46 | value=0.11663
julia> select(grp, (1, 2 => value))
Table with 2 rows, 2 columns: 1 2 ──────────────── false -0.227742 true 0.11663
The OnlineStats.value
function extracts the value of the statistic. E.g. value(Mean())
.
Calculating Statistics on Multiple Columns.
The OnlineStats.Group
type is used for calculating statistics on multiple data streams. A Group
that computes the same OnlineStat
can be created through integer multiplication:
reduce(3Mean(), t)
Group
├─ Mean: n=100 | value=50.5
├─ Mean: n=100 | value=0.46
└─ Mean: n=100 | value=-0.0693309
Alternatively, a Group
can be created by providing a collection of OnlineStat
s.
reduce(Group(Extrema(Int), CountMap(Bool), Mean()), t)
Group
├─ Extrema: n=100 | value=(min = 1, max = 100, nmin = 1, nmax = 1)
├─ CountMap: n=100 | value=OrderedCollections.OrderedDict{Bool, Int64}(1=>46, 0=>54)
└─ Mean: n=100 | value=-0.0693309