API

JuliaDB.JuliaDB
IndexedTables.ColDict
IndexedTables.IndexedTable
IndexedTables.Keys
IndexedTables.NDSparse
IndexedTables.Perm
JuliaDB.DIndexedTable
JuliaDB.DNDSparse
JuliaDB.IndexSpace
JuliaDB.Interval
Base.Broadcast.broadcast
Base.collect
Base.convert
Base.filter
Base.getindex
Base.join
Base.keys
Base.length
Base.map
Base.map
Base.map
Base.merge
Base.pairs
Base.reduce
Base.reduce
Base.sort
Base.sort!
Base.values
Dagger.compute
Dagger.distribute
Dagger.distribute
Dagger.distribute
Dagger.load
Dagger.save
Dagger.save
IndexedTables.aggregate!
IndexedTables.arrayof
IndexedTables.asofjoin
IndexedTables.best_perm_estimate
IndexedTables.collect_columns
IndexedTables.colnames
IndexedTables.columns
IndexedTables.columns
IndexedTables.convertdim
IndexedTables.convertdim
IndexedTables.convertmissing
IndexedTables.dimlabels
IndexedTables.dropmissing
IndexedTables.excludecols
IndexedTables.flatten
IndexedTables.flush!
IndexedTables.groupby
IndexedTables.groupjoin
IndexedTables.groupreduce
IndexedTables.insertcols
IndexedTables.insertcolsafter
IndexedTables.insertcolsbefore
IndexedTables.leftjoin
IndexedTables.map_rows
IndexedTables.naturaljoin
IndexedTables.naturaljoin
IndexedTables.ncols
IndexedTables.ndsparse
IndexedTables.pkeynames
IndexedTables.pkeynames
IndexedTables.pkeys
IndexedTables.reducedim_vec
IndexedTables.reducedim_vec
IndexedTables.reindex
IndexedTables.rename
IndexedTables.rows
IndexedTables.select
IndexedTables.selectkeys
IndexedTables.selectvalues
IndexedTables.stack
IndexedTables.summarize
IndexedTables.table
IndexedTables.unstack
IndexedTables.update!
IndexedTables.where
JuliaDB.fromchunks
JuliaDB.loadndsparse
JuliaDB.loadtable
JuliaDB.mapchunks
JuliaDB.partitionplot
JuliaDB.rechunk
JuliaDB.tracktime
StatsBase.transform

JuliaDB.JuliaDB — Module

JuliaDB

JuliaDB is a package for working with large persistent data sets.

JuliaDB is an all-Julia, end-to-end tool that can

Load multi-dimensional datasets quickly and incrementally
Index the data and perform filter, aggregate, sort, and join operations
Save results and load them efficiently later
Use Julia's built-in parallelism to fully utilize any machine or cluster

Introduce yourself to JuliaDB's features at juliadb.org or jump into the documentation here!

source

JuliaDB.DIndexedTable — Type

A distributed table

source

JuliaDB.DNDSparse — Type

DNDSparse{K,V} <: AbstractNDSparse

A distributed NDSparse datastructure. Can be constructed by:

ndsparse from Julia objects
loadndsparse from data on disk
distribute from an NDSparse object

source

JuliaDB.IndexSpace — Type

IndexSpace(interval, boundingrect, nrows)

Metadata about an chunk.

interval: An Interval object with the first and the last index tuples.
boundingrect: An Interval object with the lowest and the highest indices as tuples.
nrows: A Nullable{Int} of number of rows in the NDSparse, if knowable.

source

JuliaDB.Interval — Type

An interval type tailored specifically to store intervals of indices of an NDSparse object. Some of the operations on this like in or < may be controversial for a generic Interval type.

source

Base.collect — Method

collect(t::DNDSparse)

Gets distributed data in a DNDSparse t and merges it into NDSparse object

Warning

collect(t) requires at least as much memory as the size of the result of the computing t. If the result is expected to be big, try compute(save(t, "output_dir")) instead. See save for more. This data can be loaded later using load.

source

Base.getindex — Method

t[idx...]

Returns a DNDSparse containing only the elements of t where the given indices (idx) match. If idx has the same type as the index tuple of the t, then this is considered a scalar indexing (indexing of a single value). In this case the value itself is looked up and returned.

source

Base.length — Method

The length of the DNDSparse if it can be computed. Will throw an error if not. You can get the length of such tables after calling compute on them.

source

Base.map — Method

map(f, t::DNDSparse)

Applies a function f on every element in the data of table t.

source

Dagger.compute — Method

compute(t::DNDSparse; allowoverlap, closed)

Computes any delayed-evaluations in the DNDSparse. The computed data is left on the worker processes. Subsequent operations on the results will reuse the chunks.

If allowoverlap is false then the computed data is re-sorted if required to have no chunks with overlapping index ranges if necessary.

If closed is true then the computed data is re-sorted if required to have no chunks with overlapping OR continuous boundaries.

See also reducedim.

source

JuliaDB.fromchunks — Method

fromchunks(cs)

Construct a distributed object from chunks. Calls fromchunks(T, cs) where T is the type of the data in the first chunk. Computes any thunks.

source

JuliaDB.loadndsparse — Method

loadndsparse(files::Union{AbstractVector,String}; <options>)

Load an NDSparse from CSV files.

files is either a vector of file paths, or a directory name.

Options:

indexcols::Vector – columns to use as indexed columns. (by default a 1:n implicit index is used.)
datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

All other options are identical to those in loadtable

source

JuliaDB.loadtable — Method

loadtable(files::Union{AbstractVector,String}; <options>)

Load a table from CSV files.

files is either a vector of file paths, or a directory name.

Options:

output::AbstractString – directory name to write the table to. By default data is loaded directly to memory. Specifying this option will allow you to load data larger than the available memory.
indexcols::Vector – columns to use as primary key columns. (defaults to [])
datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)
distributed::Bool – should the output dataset be loaded as a distributed table? If true, this will use all available worker processes to load the data. (defaults to true if workers are available, false if not)
chunks::Int – number of chunks to create when loading distributed. (defaults to number of workers)
delim::Char – the delimiter character. (defaults to ,). Use spacedelim=true to split by spaces.
spacedelim::Bool: parse space-delimited files. delim has no effect if true.
quotechar::Char – quote character. (defaults to ")
escapechar::Char – escape character. (defaults to ")
filenamecol::Union{Symbol, Pair} – create a column containing the file names from where each row came from. This argument gives a name to the column. By default, basename(name) of the name is kept, and ".csv" suffix will be stripped. To provide a custom function to apply on the names, use a name => Function pair. By default, no file name column will be created.
header_exists::Bool – does header exist in the files? (defaults to true)
colnames::Vector{String} – specify column names for the files, use this with (header_exists=false, otherwise first row is discarded). By default column names are assumed to be present in the file.
samecols – a vector of tuples of strings where each tuple contains alternative names for the same column. For example, if some files have the name "vendorid" and others have the name "VendorID", pass `samecols=[("VendorID", "vendorid")]`.
colparsers – either a vector or dictionary of data types or an AbstractToken object from TextParse package. By default, these are inferred automatically. See type_detect_rows option below.
type_detect_rows: number of rows to use to infer the initial colparsers defaults to 20.
nastrings::Vector{String} – strings that are to be considered missing values. (defaults to TextParse.NA_STRINGS)
skiplines_begin::Char – skip some lines in the beginning of each file. (doesn't skip by default)
usecache::Bool: (vestigial)

source

JuliaDB.mapchunks — Method

mapchunks(f, t::DNDSparse; keeplengths=true)

Applies a function to each chunk in t. Returns a new DNDSparse. If keeplength is false, this means that the lengths of the output chunks is unknown before compute. This function is used internally by many DNDSparse operations.

source

JuliaDB.partitionplot — Function

partitionplot(table, y;    stat=Extrema(), nparts=100, by=nothing, dropmissing=false)
partitionplot(table, x, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)

Plot a summary of variable y against x (1:length(y) if not specified). Using nparts approximately-equal sections along the x-axis, the data in y over each section is summarized by stat.

source

JuliaDB.rechunk — Function

rechunk(t::Union{DNDSparse, DNDSparse}[, by[, select]]; <options>)

Reindex and sort a distributed dataset by keys selected by by.

Optionally select specifies which non-indexed fields are kept. By default this is all fields not mentioned in by for Table and the value columns for NDSparse.

Options:

chunks – how to distribute the data. This can be:
1. An integer – number of chunks to create
2. An vector of k integers – number of elements in each of the k chunks. sum(k) must be same as length(t)
3. The distribution of another array. i.e. vec.subdomains where vec is a distributed array.
merge::Function – a function which merges two sub-table or sub-ndsparse into one NDSparse. They may have overlaps in their indices.
splitters::AbstractVector – specify keys to split by. To create n chunks you would need to pass n-1 splitters and also the chunks=n option.
chunks_sorted::Bool – are the chunks sorted locally? If true, this skips sorting or re-indexing them.
affinities::Vector{<:Integer} – which processes (Int pid) should each output chunk be created on. If unspecified all workers are used.
closed::Bool – if true, the same key will not be present in multiple chunks (although sorted). true by default.
nsamples::Integer – number of keys to randomly sample from each chunk to estimate splitters in the sorting process. (See samplesort). Defaults to 2000.
batchsize::Integer – how many chunks at a time from the input should be loaded into memory at any given time. This will essentially sort in batches of batchsize chunks.

source

JuliaDB.tracktime — Method

tracktime(f)

Track the time spent on different processes in different categories in running f.

source

IndexedTables.ColDict — Method

d = ColDict(t)

Create a mutable dictionary of columns in t.

To get the immutable iterator of the same type as t call d[]

IndexedTables.IndexedTable — Type

A tabular data structure that extends Columns. Create an IndexedTable with the table function.

IndexedTables.Keys — Type

Keys()

Select the primary keys.

Examples

t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Keys())

IndexedTables.NDSparse — Method

NDSparse(columns...; names=Symbol[...], kwargs...)

Construct an NDSparse array from columns. The last argument is the data column, and the rest are index columns. The names keyword argument optionally specifies names for the index columns (dimensions).

IndexedTables.Perm — Type

A permutation

Fields:

columns: The columns being indexed as a vector of integers (column numbers)
perm: the permutation - an array or iterator which has the sorted permutation

Base.Broadcast.broadcast — Method

broadcast(f, A::NDSparse, B::NDSparse; dimmap::Tuple{Vararg{Int}})
A .* B

Compute an inner join of A and B using function f, where the dimensions of B are a subset of the dimensions of A. Values from B are repeated over the extra dimensions.

dimmap optionally specifies how dimensions of A correspond to dimensions of B. It is a tuple where dimmap[i]==j means the ith dimension of A matches the jth dimension of B. Extra dimensions that do not match any dimensions of j should have dimmap[i]==0.

If dimmap is not specified, it is determined automatically using index column names and types.

Example

a = ndsparse(([1,1,2,2], [1,2,1,2]), [1,2,3,4])
b = ndsparse([1,2], [1/1, 1/2])
broadcast(*, a, b)

dimmap maps dimensions that should be broadcasted:

broadcast(*, a, b, dimmap=(0,1))

Base.convert — Method

convert(IndexedTable, pkeys, vals; kwargs...)

Construct a table with pkeys as primary keys and vals as corresponding non-indexed items. keyword arguments will be forwarded to table constructor.

Example

convert(IndexedTable, Columns(x=[1,2],y=[3,4]), Columns(z=[1,2]), presorted=true)

Base.filter — Method

filter(f, t::Union{IndexedTable, NDSparse}; select)

Iterate over t and Return the rows for which f(row) returns true. select determines the rows that are given as arguments to f (see select).

f can also be a tuple of column => function pairs. Returned rows will be those for which all conditions are true.

Example

# filter iterates over ROWS of a IndexedTable
t = table(rand(100), rand(100), rand(100), names = [:x, :y, :z])
filter(r -> r.x + r.y + r.z < 1, t)

# filter iterates over VALUES of an NDSparse
x = ndsparse(1:100, randn(100))
filter(val -> val > 0, x)

Base.join — Method

join(left, right; kw...)
join(f, left, right; kw...)

Join tables left and right.

If a function f(leftrow, rightrow) is provided, the returned table will have a single output column. See the Examples below.

If the same key occurs multiple times in either table, each left row will get matched with each right row, resulting in n_occurrences_left * n_occurrences_right output rows.

Options (keyword arguments)

how = :inner
- Join method to use. Described below.
lkey = pkeys(left)
- Fields from left to match on (see pkeys).
rkey = pkeys(right)
- Fields from right to match on.
lselect = Not(lkey)
- Output columns from left (see Not)
rselect = Not(rkey)
- Output columns from right.
missingtype = Missing
- Type of missing values that can be created through :left and :outer joins.
- Other supported option is DataValue.

Join methods (how = :inner)

:inner – rows with matching keys in both tables
:left – all rows from left, plus matched rows from right (missing values can occur)
:outer – all rows from both tables (missing values can occur)
:anti – rows in left WITHOUT matching keys in right

Examples

a = table((x = 1:10,   y = rand(10)), pkey = :x)
b = table((x = 1:2:20, z = rand(10)), pkey = :x)

join(a, b; how = :inner)
join(a, b; how = :left)
join(a, b; how = :outer)
join(a, b; how = :anti)

join((l, r) -> l.y + r.z, a, b)

Base.keys — Method

keys(x::NDSparse[, select::Selection])

Get the keys of an NDSparse object. Same as rows but acts only on the index columns of the NDSparse.

Base.map — Method

map(f, t::IndexedTable; select)

Apply f to every item in t selected by select (see also the select function). Returns a new table if f returns a tuple or named tuple. If not, returns a vector.

Examples

t = table([1,2], [3,4], names=[:x, :y])

polar = map(p -> (r = hypot(p.x, p.y), θ = atan(p.y, p.x)), t)

back2t = map(p -> (x = p.r * cos(p.θ), y = p.r * sin(p.θ)), polar)

Base.map — Method

map(f, x::NDSparse; select = values(x))

Apply f to every value of select selected from x (see select).

Apply f to every data value in x. select selects fields passed to f. By default, the data values are selected.

If the return value of f is a tuple or named tuple the result will contain many data columns.

Examples

x = ndsparse((t=[0.01, 0.05],), (x=[1,2], y=[3,4]))

polar = map(row -> (r = hypot(row.x, row.y), θ = atan(row.y, row.x)), x)

back2x = map(row -> (x = row.r * cos(row.θ), y = row.r * sin(row.θ)), polar)

Base.merge — Method

merge(a::IndexedTable, b::IndexedTable; pkey)

Merge rows of a with rows of b and remain ordered by the primary key(s). a and b must have the same column names.

merge(a::NDSparse, a::NDSparse; agg)

Merge rows of a with rows of b. To keep unique keys, the value from b takes priority. A provided function agg will aggregate values from a and b that have the same key(s).

Example:

a = table((x = 1:5, y = rand(5)); pkey = :x)
b = table((x = 6:10, y = rand(5)); pkey = :x)
merge(a, b)

a = ndsparse([1,3,5], [1,2,3])
b = ndsparse([2,3,4], [4,5,6])
merge(a, b)
merge(a, b; agg = (x,y) -> x)

Base.pairs — Method

pairs(arr::NDSparse, indices...)

Similar to where, but returns an iterator giving index=>value pairs. index will be a tuple.

Base.reduce — Method

reduce(f, t::IndexedTable; select::Selection)

Apply reducer function f pair-wise to the selection select in t. The reducer f can be:

A function
An OnlineStat
A (named) tuple of functions and/or OnlineStats
A (named) tuple of (selector => function) or (selector => OnlineStat) pairs

Examples

t = table(1:5, 6:10, names = [:t, :x])

reduce(+, t, select = :t)
reduce((a, b) -> (t = a.t + b.t, x = a.x + b.x), t)

using OnlineStats
reduce(Mean(), t, select = :t)
reduce((Mean(), Variance()), t, select = :t)

y = reduce((min, max), t, select=:x)
reduce((sum = +, prod = *), t, select=:x)

# combining reduction and selection
reduce((xsum = :x => +, negtsum = (:t => -) => +), t)

Base.reduce — Method

reduce(f, x::NDSparse; dims)

Drop the dims dimension(s) and aggregate values with f.

x = ndsparse((x=[1,1,1,2,2,2],
              y=[1,2,2,1,2,2],
              z=[1,1,2,1,1,2]), [1,2,3,4,5,6])

reduce(+, x; dims=1)
reduce(+, x; dims=(1,3))

Base.sort! — Method

sort!(t    ; kw...)
sort!(t, by; kw...)

Sort rows of t by by in place. All of Base.sort keyword arguments can be used.

Examples

t = table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z]);
sort!(t, :z, rev = true)
t

Base.sort — Method

sort(t    ; select, kw...)
sort(t, by; select, kw...)

Sort rows by by. All of Base.sort keyword arguments can be used.

Examples

t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
sort(t, :z; select = (:y, :z), rev = true)

Base.values — Method

values(x::NDSparse[, select::Selection])

Get the values of an NDSparse object. Same as rows but acts only on the value columns of the NDSparse.

IndexedTables.aggregate! — Method

aggregate!(f::Function, arr::NDSparse)

Combine adjacent rows with equal indices using the given 2-argument reduction function, in place.

IndexedTables.arrayof — Method

arrayof(T)

Returns the type of Columns or Vector suitable to store values of type T. Nested tuples beget nested Columns.

IndexedTables.asofjoin — Method

asofjoin(left::NDSparse, right::NDSparse)

Join rows from left with the "most recent" value from right.

Example

using Dates
akey1 = ["A", "A", "B", "B"]
akey2 = [Date(2017,11,11), Date(2017,11,12), Date(2017,11,11), Date(2017,11,12)]
avals = collect(1:4)

bkey1 = ["A", "A", "B", "B"]
bkey2 = [Date(2017,11,12), Date(2017,11,13), Date(2017,11,10), Date(2017,11,13)]
bvals = collect(5:8)

a = ndsparse((akey1, akey2), avals)
b = ndsparse((bkey1, bkey2), bvals)

asofjoin(a, b)

IndexedTables.best_perm_estimate — Method

Returns: (n, perm) where n is the number of columns in the beginning of cols, perm is one possible permutation of those first n columns.

IndexedTables.collect_columns — Method

collect_columns(itr)

Collect an iterable as a Columns object if it iterates Tuples or NamedTuples, as a normal Array otherwise.

Examples

s = [(1,2), (3,4)]
collect_columns(s)

s2 = Iterators.filter(isodd, 1:8)
collect_columns(s2)

IndexedTables.colnames — Function

colnames(itr)

Returns the names of the "columns" in itr.

Examples:

colnames(1:3)
colnames(Columns([1,2,3], [3,4,5]))
colnames(table([1,2,3], [3,4,5]))
colnames(Columns(x=[1,2,3], y=[3,4,5]))
colnames(table([1,2,3], [3,4,5], names=[:x,:y]))
colnames(ndsparse(Columns(x=[1,2,3]), Columns(y=[3,4,5])))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns([1,2,3], [4,5,6]), Columns(x=[6,7,8])))
colnames(ndsparse(Columns(x=[1,2,3]), Columns([3,4,5],[6,7,8])))

IndexedTables.columns — Function

columns(itr, select::Selection = All())

Select one or more columns from an iterable of rows as a tuple of vectors.

select specifies which columns to select. Refer to the select function for the available selection options and syntax.

itr can be NDSparse, Columns, AbstractVector, or their distributed counterparts.

Examples

t = table(1:2, 3:4; names = [:x, :y])

columns(t)
columns(t, :x)
columns(t, (:x,))
columns(t, (:y, :x => -))

IndexedTables.columns — Method

columns(itr, which)

Returns a vector or a tuple of vectors from the iterator.

IndexedTables.convertdim — Method

convertdim(x::NDSparse, d::DimName, xlate; agg::Function, vecagg::Function, name)

Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg or vecagg is used to aggregate the results. If agg is passed, it is used as a 2-argument reduction function over the data. If vecagg is passed, it is used as a vector-to-scalar function to aggregate the data. name optionally specifies a new name for the translated dimension.

IndexedTables.convertmissing — Method

convertmissing(tbl, missingtype)

Convert the missing value representation in tbl to be of type missingtype.

Example

using IndexedTables, DataValues
t = table([1,2,missing], [1,missing,3])
IndexedTables.convertmissing(t, DataValue)

IndexedTables.dimlabels — Method

dimlabels(t::NDSparse)

Returns an array of integers or symbols giving the labels for the dimensions of t. ndims(t) == length(dimlabels(t)).

IndexedTables.dropmissing — Function

dropmissing(t        )
dropmissing(t, select)

Drop rows of table t which contain missing values (either Missing or DataValue), optionally only using the columns in select. Column types will be converted to non-missing types. For example:

Vector{Union{Int, Missing}} -> Vector{Int}
DataValueArray{Int} -> Vector{Int}

Example

t = table([0.1,0.5,missing,0.7], [2,missing,4,5], [missing,6,missing,7], names=[:t,:x,:y])
dropmissing(t)
dropmissing(t, (:t, :x))

IndexedTables.excludecols — Method

excludecols(itr, cols) -> Tuple of Int

Names of all columns in itr except cols. itr can be any of IndexedTable, NDSparse, StructArrays.StructVector, or AbstractVector

Examples

using IndexedTables: excludecols

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

excludecols(t, (:x,))
excludecols(t, (2,))
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))

IndexedTables.flatten — Function

flatten(t::Table, col=length(columns(t)))

Flatten col column which may contain a vector of iterables while repeating the other fields. If column argument is not provided, default to last column.

Examples:

x = table([1,2], [[3,4], [5,6]], names=[:x, :y])
flatten(x, 2)

t1 = table([3,4],[5,6], names=[:a,:b])
t2 = table([7,8], [9,10], names=[:a,:b])
x = table([1,2], [t1, t2], names=[:x, :y]);
flatten(x, :y)

IndexedTables.flush! — Method

flush!(arr::NDSparse)

Commit queued assignment operations, by sorting and merging the internal temporary buffer.

IndexedTables.groupby — Function

groupby(f, t, by = pkeynames(t); select, flatten=false, usekey = false)

Apply f to the select-ed columns (see select) in groups defined by the unique values of by.

If f returns a vector, split it into multiple columns with flatten = true.

To retain the grouping key in the resulting group use usekey = true.

Examples

using Statistics

t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z])

groupby(mean, t, :x, select=:z)
groupby(identity, t, (:x, :y), select=:z)
groupby(mean, t, (:x, :y), select=:z)

groupby((mean, std, var), t, :y, select=:z)
groupby((q25=z->quantile(z, 0.25), q50=median, q75=z->quantile(z, 0.75)), t, :y, select=:z)

# apply different aggregation functions to different columns
groupby((ymean = :y => mean, zmean = :z => mean), t, :x)

# include the grouping key
groupby(t, by; usekey = true) do key, dd
    # code using key as key (named tuple) and dd as data
end

IndexedTables.groupjoin — Method

groupjoin(left, right; kw...)
groupjoin(f, left, right; kw...)

Join left and right creating groups of values with matching keys.

For keyword argument options, see join.

Examples

l = table([1,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:c], pkey=(:a, :b))
r = table([0,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:d], pkey=(:a, :b))

groupjoin(l, r)
groupjoin(l, r; how = :left)
groupjoin(l, r; how = :outer)
groupjoin(l, r; how = :anti)

IndexedTables.groupreduce — Function

groupreduce(f, t, by = pkeynames(t); select)

Calculate a reduce operation f over table t on groups defined by the values in selection by. The result is put in a table keyed by the unique by values.

Examples

t = table([1,1,1,2,2,2], 1:6, names=[:x, :y])
groupreduce(+,        t, :x; select = :y)
groupreduce((sum=+,), t, :x; select = :y)  # change output column name to :sum

t2 = table([1,1,1,2,2,2], [1,1,2,2,3,3], 1:6, names = [:x, :y, :z])
groupreduce(+, t2, (:x, :y), select = :z)

# different reducers for different columns
groupreduce((sumy = :y => +, sumz = :z => +), t2, :x)

IndexedTables.insertcols — Method

insertcols(t, position::Integer, map::Pair...)

For each pair name => col in map, insert a column col named name starting at position. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcol(t, 2, :w => [0,1])

IndexedTables.insertcolsafter — Method

insertcolsafter(t, after, map::Pair...)

For each pair name => col in map, insert a column col named name after after. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsafter(t, :t, :w => [0,1])

IndexedTables.insertcolsbefore — Method

insertcolsbefore(t, before, map::Pair...)

For each pair name => col in map, insert a column col named name before before. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsbefore(t, :x, :w => [0,1])

IndexedTables.map_rows — Method

map_rows(f, c...)

Transform collection c by applying f to each element. For multiple collection arguments, apply f elementwise. Collect output as Columns if f returns Tuples or NamedTuples with constant fields, as Array otherwise.

Examples

map_rows(i -> (exp = exp(i), log = log(i)), 1:5)

IndexedTables.ncols — Function

ncols(itr)

Returns the number of columns in itr.

Examples

ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2

IndexedTables.ndsparse — Function

ndsparse(keys, values; kw...)

Construct an NDSparse array with the given keys and values columns. On construction, the keys and data are sorted in lexicographic order of the keys.

Keyword Argument Options:

agg = nothing – Function to aggregate values with duplicate keys.
presorted = false – Are the key columns already sorted?
copy = true – Should the columns in keys and values be copied?
chunks = nothing – Provide an integer to distribute data into chunks chunks.
- A good choice is nworkers() (after using Distributed)
- See also: distribute

Examples:

x = ndsparse(["a","b"], [3,4])
keys(x)
values(x)
x["a"]

# Dimensions are named if constructed with a named tuple of columns
x = ndsparse((index = 1:10,), rand(10))
x[1]

# Multiple dimensions by passing a (named) tuple of columns
x = ndsparse((x = 1:10, y = 1:2:20), rand(10))
x[1, 1]

# Value columns can also have names via named tuples
x = ndsparse(1:10, (x=rand(10), y=rand(10)))

IndexedTables.pkeynames — Method

pkeynames(t::Table)

Names of the primary key columns in t.

Examples

t = table([1,2], [3,4]);
pkeynames(t)

t = table([1,2], [3,4], pkey=1);
pkeynames(t)

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2));
pkeynames(t)

IndexedTables.pkeynames — Method

pkeynames(t::NDSparse)

Names of the primary key columns in t.

Example

x = ndsparse([1,2],[3,4])
pkeynames(x)

x = ndsparse((x=1:10, y=1:2:20), rand(10))
pkeynames(x)

IndexedTables.pkeys — Method

pkeys(itr::IndexedTable)

Primary keys of the table. If Table doesn't have any designated primary key columns (constructed without pkey argument) then a default key of tuples (1,):(n,) is generated.

Example

a = table(["a","b"], [3,4]) # no pkey
pkeys(a)

a = table(["a","b"], [3,4], pkey=1)
pkeys(a)

IndexedTables.reducedim_vec — Method

reducedim_vec(f::Function, arr::NDSparse, dims)

Like reduce, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.

IndexedTables.reindex — Function

reindex(t::IndexedTable, by)
reindex(t::IndexedTable, by, select)

Reindex table t with new primary key by, optionally keeping a subset of columns via select. For NDSparse, use selectkeys.

Example

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

t2 = reindex(t, (:y, :z))

pkeynames(t2)

IndexedTables.rename — Method

rename(t, map::Pair...)

For each pair col => newname in map, set newname as the new name for column col in t. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], names=[:t, :x])
rename(t, :t => :time)

IndexedTables.rows — Function

rows(itr, select = All())

Select one or more fields from an iterable of rows as a vector of their values. Refer to the select function for selection options and syntax.

itr can be NDSparse, StructArrays.StructVector, AbstractVector, or their distributed counterparts.

Examples

t = table([1,2],[3,4], names=[:x,:y])
rows(t)
rows(t, :x)
rows(t, (:x,))
rows(t, (:y, :x => -))

IndexedTables.select — Method

select(t::Table, which::Selection)

Select all or a subset of columns, or a single column from the table.

Selection is a type union of many types that can select from a table. It can be:

Integer – returns the column at this position.
Symbol – returns the column with this name.
Pair{Selection => Function} – selects and maps a function over the selection, returns the result.
AbstractArray – returns the array itself. This must be the same length as the table.
Tuple of Selection – returns a table containing a column for every selector in the tuple.
Regex – returns the columns with names that match the regular expression.
Type – returns columns with elements of the given type.

Examples:

t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])

# select the :x vector
select(t, 1)
select(t, :x)

# map a function to the :y vector
select(t, 2 => abs)
select(t, :y => x -> x > 0 ? x : -x)

# select the table of :x and :z
select(t, (:x, :z))
select(t, r"(x|z)")

# map a function to the table of :x and :y
select(t, (:x, :y) => row -> row[1] + row[2])
select(t, (1, :y) => row -> row.x + row.y)

IndexedTables.selectkeys — Method

selectkeys(x::NDSparse, sel)

Return an NDSparse with a subset of keys.

IndexedTables.selectvalues — Method

selectvalues(x::NDSparse, sel)

Return an NDSparse with a subset of values

IndexedTables.stack — Method

stack(t, by = pkeynames(t); select = Not(by), variable = :variable, value = :value)`

Reshape a table from the wide to the long format. Columns in by are kept as indexing columns. Columns in select are stacked. In addition to the id columns, two additional columns labeled variable and value are added, containing the column identifier and the stacked columns. See also unstack.

Examples

t = table(1:4, names = [:x], pkey=:x)
t = pushcol(t, :xsquare, :x => x -> x^2)
t = pushcol(t, :xcube  , :x => x -> x^3)

stack(t)

IndexedTables.summarize — Function

summarize(f, t, by = pkeynames(t); select = Not(by), stack = false, variable = :variable)

Apply summary functions column-wise to a table. Return a NamedTuple in the non-grouped case and a table in the grouped case. Use stack=true to stack results of the same summary function for different columns.

Examples

using Statistics

t = table([1, 2, 3], [1, 1, 1], names = [:x, :y])

summarize((mean, std), t)
summarize((m = mean, s = std), t)
summarize(mean, t; stack=true)
summarize((mean, std), t; select = :y)

IndexedTables.table — Function

table(cols; kw...)

Create a table from a (named) tuple of AbstractVectors.

table(cols::AbstractVector...; names::Vector{Symbol}, kw...)

Create a table from the provided cols, optionally with names.

table(cols::Columns; kw...)

Construct a table from a vector of tuples. See rows and Columns.

table(t::Union{IndexedTable, NDSparse}; kw...)

Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.

table(x; kw...)

Create an IndexedTable from any object x that follows the Tables.jl interface.

Keyword Argument Options:

pkey: select columns to sort by and be the primary key.
presorted = false: is the data pre-sorted by primary key columns?
copy = true: creates a copy of the input vectors if true. Irrelevant if chunks is specified.
chunks::Integer: distribute the table. Options are:
- Int – (number of chunks) a safe bet is nworkers() after using Distributed.
- Vector{Int} – Number of elements in each of the length(chunks) chunks.

Examples:

table(rand(10), rand(10), names = [:x, :y], pkey = :x)

table(rand(Bool, 20), rand(20), rand(20), pkey = [1,2])

table((x = 1:10, y = randn(10)))

table([(1,2), (3,4)])

IndexedTables.unstack — Method

unstack(t, by = pkeynames(t); variable = :variable, value = :value)

Reshape a table from the long to the wide format. Columns in by are kept as indexing columns. Keyword arguments variable and value denote which column contains the column identifier and which the corresponding values. See also stack.

Examples

t = table(1:4, [1, 4, 9, 16], [1, 8, 27, 64], names = [:x, :xsquare, :xcube], pkey = :x);

long = stack(t)

unstack(long)

IndexedTables.update! — Method

update!(f::Function, arr::NDSparse, indices...)

Replace data values x with f(x) at each location that matches the given indices.

IndexedTables.where — Method

where(arr::NDSparse, indices...)

Returns an iterator over data items where the given indices match. Accepts the same index arguments as getindex.

StatsBase.transform — Method

transform(t::Table, changes::Pair...)

Transform columns of t. For each pair col => value in changes the column col is replaced by the AbstractVector value. If col is not an existing column, a new column is created.

Examples:

t = table([1,2], [3,4], names=[:x, :y])

# change second column to [5,6]
transform(t, 2 => [5,6])
transform(t, :y => :y => x -> x + 2)

# add [5,6] as column :z
transform(t, :z => 5:6)
transform(t, :z => :y => x -> x + 2)

# replacing the primary key results in a re-sorted copy
t = table([0.01, 0.05], [1,2], [3,4], names=[:t, :x, :y], pkey=:t)
t2 = transform(t, :t => [0.1,0.05])

# the column :z is not part of t so a new column is added
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
transform(t, :z => [1//2, 3//4])