Basics
JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.
Here is a high level overview of tables in JuliaDB:
- Tables store data in columns.
- Tables are typed.
- Changing a table in some way therefore requires returning a new table (underlying data is not copied).
- JuliaDB has few mutating operations because a new table is necessary in most cases.
Data for examples:
x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);
10-element Vector{Float64}:
-0.10159561356279523
-1.0497669443355508
0.1387817536928029
1.3656003856430101
1.2117421702412088
-0.24526010252898056
-0.9121230433477152
1.9997213420959465
-0.1390732680297047
-1.4708041785695336
IndexedTable
An IndexedTable
is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x
and :y
).
An IndexedTable
is created with data in Julia via the table
function or with data on disk via the loadtable
function.
julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])
Table with 10 rows, 3 columns: x y z ────────────────── 1 'a' -0.101596 2 'a' -1.04977 3 'a' 0.138782 4 'a' 1.3656 5 'b' 1.21174 6 'b' -0.24526 7 'b' -0.912123 8 'b' 1.99972 9 'b' -0.139073 10 'b' -1.4708
julia> t[1]
(x = 1, y = 'a', z = -0.10159561356279523)
julia> t[end]
(x = 10, y = 'b', z = -1.4708041785695336)
NDSparse
An NDSparse
has a similar underlying structure to IndexedTable
, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse
are sorted, much like the primary keys of an IndexedTable
.
An NDSparse
is created with data in Julia via the ndsparse
function or with data on disk via the loadndsparse
function.
julia> nd = ndsparse((x=x, y=y), (z=z,))
2-d NDSparse with 10 values (1 field named tuples): x y │ z ────────┼────────── 1 'a' │ -0.101596 2 'a' │ -1.04977 3 'a' │ 0.138782 4 'a' │ 1.3656 5 'b' │ 1.21174 6 'b' │ -0.24526 7 'b' │ -0.912123 8 'b' │ 1.99972 9 'b' │ -0.139073 10 'b' │ -1.4708
julia> nd[1, 'a']
(z = -0.10159561356279523,)
julia> nd[10, 'j'].z
ERROR: KeyError: key (10, 'j') not found
julia> nd[1, :]
1-d NDSparse with 1 values (1 field named tuples): y │ z ────┼────────── 'a' │ -0.101596
Selectors
JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select
, reduce
, groupreduce
, groupby
, join
, transform
, reindex
, and more.
To demonstrate selection, we'll use the select
function. A selection can be any of the following types:
Integer
– returns the column at this position.Symbol
– returns the column with this name.Pair{Selection => Function}
– selects and maps a function over the selection, returns the result.AbstractArray
– returns the array itself. This must be the same length as the table.Tuple
ofSelection
– returns a table containing a column for every selector in the tuple.Regex
– returns the columns with names that match the regular expression.Type
– returns columns with elements of the given type.Not(Selection)
– returns columns that are not included in the selection.Between(first, last)
– returns columns betweenfirst
andlast
.Keys()
– return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
Table with 10 rows, 3 columns:
x y z
────────────────────
1 0.851627 true
2 0.103969 false
3 0.39099 true
4 1.38581 false
5 -0.785436 false
6 -0.533566 true
7 -0.280398 false
8 -0.534114 true
9 1.29287 false
10 0.913551 true
select the :x vector
julia> select(t, 1)
10-element Vector{Int64}: 1 2 3 4 5 6 7 8 9 10
julia> select(t, :x)
10-element Vector{Int64}: 1 2 3 4 5 6 7 8 9 10
map a function to the :y vector
julia> select(t, 2 => abs)
10-element Vector{Float64}: 0.8516274757005192 0.10396917553070863 0.3909902528311987 1.3858137585208943 0.7854356796301191 0.5335662167458792 0.28039802806173053 0.5341137332818543 1.2928674005903655 0.9135514878002643
julia> select(t, :y => x -> x > 0 ? x : -x)
10-element Vector{Float64}: 0.8516274757005192 0.10396917553070863 0.3909902528311987 1.3858137585208943 0.7854356796301191 0.5335662167458792 0.28039802806173053 0.5341137332818543 1.2928674005903655 0.9135514878002643
select the table of :x and :z
julia> select(t, (:x, :z))
Table with 10 rows, 2 columns: x z ───────── 1 true 2 false 3 true 4 false 5 false 6 true 7 false 8 true 9 false 10 true
julia> select(t, r"(x|z)")
Table with 10 rows, 2 columns: x z ───────── 1 true 2 false 3 true 4 false 5 false 6 true 7 false 8 true 9 false 10 true
map a function to the table of :x and :y
julia> select(t, (:x, :y) => row -> row[1] + row[2])
10-element Vector{Float64}: 1.851627475700519 2.103969175530709 3.3909902528311986 5.385813758520895 4.214564320369881 5.4664337832541205 6.719601971938269 7.465886266718146 10.292867400590366 10.913551487800264
julia> select(t, (1, :y) => row -> row.x + row.y)
10-element Vector{Float64}: 1.851627475700519 2.103969175530709 3.3909902528311986 5.385813758520895 4.214564320369881 5.4664337832541205 6.719601971938269 7.465886266718146 10.292867400590366 10.913551487800264
select columns that are subtypes of Integer
julia> select(t, Integer)
Table with 10 rows, 2 columns: x z ───────── 1 true 2 false 3 true 4 false 5 false 6 true 7 false 8 true 9 false 10 true
select columns that are not subtypes of Integer
julia> select(t, Not(Integer))
Table with 10 rows, 1 columns: y ───────── 0.851627 0.103969 0.39099 1.38581 -0.785436 -0.533566 -0.280398 -0.534114 1.29287 0.913551
Loading and Saving
Loading Data From CSV
Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable
and loadndsparse
functions.
using JuliaDB, DelimitedFiles
x = rand(10, 2)
writedlm("temp.csv", x, ',')
t = loadtable("temp.csv")
Table with 9 rows, 2 columns:
0.44472015615101446 0.34463697772446045
────────────────────────────────────────
0.994759 0.116042
0.297038 0.213936
0.922352 0.819499
0.199009 0.811964
0.73828 0.243236
0.250093 0.591472
0.933499 0.70859
0.554243 0.859971
0.126187 0.975411
loadtable
and loadndsparse
use Missing
to represent missing values. To load a CSV that instead uses DataValue
, see CSVFiles.jl. For more information on missing value representations, see Missing Values.
Converting From Other Data Structures
using JuliaDB, RDatasets
df = dataset("datasets", "iris") # load data as DataFrame
table(df) # Convert DataFrame to IndexedTable
Table with 150 rows, 5 columns:
SepalLength SepalWidth PetalLength PetalWidth Species
─────────────────────────────────────────────────────────────
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
⋮
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"
Save Table into Binary Format
A table can be saved to disk (for fast, efficient reloading) via the save
function.
Load Table from Binary Format
Tables that have been save
-ed can be loaded efficiently via load
.