Basics
JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.
Here is a high level overview of tables in JuliaDB:
- Tables store data in columns.
- Tables are typed.
- Changing a table in some way therefore requires returning a new table (underlying data is not copied).
- JuliaDB has few mutating operations because a new table is necessary in most cases.
Data for examples:
x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);
10-element Vector{Float64}:
-0.0468656624370505
-0.5433314559561914
0.11957453709749813
0.8859928959094242
-1.917342436712874
0.5597762348401842
0.30746180863333816
-1.2553543812297594
-1.519593213837593
0.5610472334945272
IndexedTable
An IndexedTable
is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x
and :y
).
An IndexedTable
is created with data in Julia via the table
function or with data on disk via the loadtable
function.
julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])
Table with 10 rows, 3 columns: x y z ─────────────────── 1 'a' -0.0468657 2 'a' -0.543331 3 'a' 0.119575 4 'a' 0.885993 5 'b' -1.91734 6 'b' 0.559776 7 'b' 0.307462 8 'b' -1.25535 9 'b' -1.51959 10 'b' 0.561047
julia> t[1]
(x = 1, y = 'a', z = -0.0468656624370505)
julia> t[end]
(x = 10, y = 'b', z = 0.5610472334945272)
NDSparse
An NDSparse
has a similar underlying structure to IndexedTable
, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse
are sorted, much like the primary keys of an IndexedTable
.
An NDSparse
is created with data in Julia via the ndsparse
function or with data on disk via the loadndsparse
function.
julia> nd = ndsparse((x=x, y=y), (z=z,))
2-d NDSparse with 10 values (1 field named tuples): x y │ z ────────┼─────────── 1 'a' │ -0.0468657 2 'a' │ -0.543331 3 'a' │ 0.119575 4 'a' │ 0.885993 5 'b' │ -1.91734 6 'b' │ 0.559776 7 'b' │ 0.307462 8 'b' │ -1.25535 9 'b' │ -1.51959 10 'b' │ 0.561047
julia> nd[1, 'a']
(z = -0.0468656624370505,)
julia> nd[10, 'j'].z
ERROR: KeyError: key (10, 'j') not found
julia> nd[1, :]
1-d NDSparse with 1 values (1 field named tuples): y │ z ────┼─────────── 'a' │ -0.0468657
Selectors
JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select
, reduce
, groupreduce
, groupby
, join
, transform
, reindex
, and more.
To demonstrate selection, we'll use the select
function. A selection can be any of the following types:
Integer
– returns the column at this position.Symbol
– returns the column with this name.Pair{Selection => Function}
– selects and maps a function over the selection, returns the result.AbstractArray
– returns the array itself. This must be the same length as the table.Tuple
ofSelection
– returns a table containing a column for every selector in the tuple.Regex
– returns the columns with names that match the regular expression.Type
– returns columns with elements of the given type.Not(Selection)
– returns columns that are not included in the selection.Between(first, last)
– returns columns betweenfirst
andlast
.Keys()
– return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
Table with 10 rows, 3 columns:
x y z
────────────────────
1 1.07085 false
2 -0.960728 false
3 -0.81665 false
4 -0.111198 false
5 2.24545 true
6 -0.811255 false
7 1.15603 true
8 0.759486 false
9 1.64535 false
10 -0.601752 false
select the :x vector
julia> select(t, 1)
10-element Vector{Int64}: 1 2 3 4 5 6 7 8 9 10
julia> select(t, :x)
10-element Vector{Int64}: 1 2 3 4 5 6 7 8 9 10
map a function to the :y vector
julia> select(t, 2 => abs)
10-element Vector{Float64}: 1.0708479992338487 0.9607276276622873 0.8166498931995329 0.11119787501241411 2.24545239704408 0.8112545684559442 1.1560277290477903 0.7594856747567902 1.645345579829251 0.6017516091385081
julia> select(t, :y => x -> x > 0 ? x : -x)
10-element Vector{Float64}: 1.0708479992338487 0.9607276276622873 0.8166498931995329 0.11119787501241411 2.24545239704408 0.8112545684559442 1.1560277290477903 0.7594856747567902 1.645345579829251 0.6017516091385081
select the table of :x and :z
julia> select(t, (:x, :z))
Table with 10 rows, 2 columns: x z ───────── 1 false 2 false 3 false 4 false 5 true 6 false 7 true 8 false 9 false 10 false
julia> select(t, r"(x|z)")
Table with 10 rows, 2 columns: x z ───────── 1 false 2 false 3 false 4 false 5 true 6 false 7 true 8 false 9 false 10 false
map a function to the table of :x and :y
julia> select(t, (:x, :y) => row -> row[1] + row[2])
10-element Vector{Float64}: 2.0708479992338487 1.0392723723377126 2.183350106800467 3.8888021249875857 7.24545239704408 5.1887454315440555 8.156027729047791 8.75948567475679 10.64534557982925 9.398248390861491
julia> select(t, (1, :y) => row -> row.x + row.y)
10-element Vector{Float64}: 2.0708479992338487 1.0392723723377126 2.183350106800467 3.8888021249875857 7.24545239704408 5.1887454315440555 8.156027729047791 8.75948567475679 10.64534557982925 9.398248390861491
select columns that are subtypes of Integer
julia> select(t, Integer)
Table with 10 rows, 2 columns: x z ───────── 1 false 2 false 3 false 4 false 5 true 6 false 7 true 8 false 9 false 10 false
select columns that are not subtypes of Integer
julia> select(t, Not(Integer))
Table with 10 rows, 1 columns: y ───────── 1.07085 -0.960728 -0.81665 -0.111198 2.24545 -0.811255 1.15603 0.759486 1.64535 -0.601752
Loading and Saving
Loading Data From CSV
Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable
and loadndsparse
functions.
using JuliaDB, DelimitedFiles
x = rand(10, 2)
writedlm("temp.csv", x, ',')
t = loadtable("temp.csv")
Table with 9 rows, 2 columns:
0.3454284142279518 0.33009330171710827
───────────────────────────────────────
0.675905 0.162582
0.220677 0.197045
0.967087 0.0651113
0.0240857 0.204306
0.59852 0.521566
0.946908 0.799856
0.711715 0.60849
0.0838515 0.659101
0.994414 0.549732
loadtable
and loadndsparse
use Missing
to represent missing values. To load a CSV that instead uses DataValue
, see CSVFiles.jl. For more information on missing value representations, see Missing Values.
Converting From Other Data Structures
using JuliaDB, RDatasets
df = dataset("datasets", "iris") # load data as DataFrame
table(df) # Convert DataFrame to IndexedTable
Table with 150 rows, 5 columns:
SepalLength SepalWidth PetalLength PetalWidth Species
─────────────────────────────────────────────────────────────
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
⋮
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"
Save Table into Binary Format
A table can be saved to disk (for fast, efficient reloading) via the save
function.
Load Table from Binary Format
Tables that have been save
-ed can be loaded efficiently via load
.