API

JuliaDB.JuliaDBModule
JuliaDB

JuliaDB is a package for working with large persistent data sets.

JuliaDB is an all-Julia, end-to-end tool that can

  • Load multi-dimensional datasets quickly and incrementally
  • Index the data and perform filter, aggregate, sort, and join operations
  • Save results and load them efficiently later
  • Use Julia's built-in parallelism to fully utilize any machine or cluster

Introduce yourself to JuliaDB's features at juliadb.org or jump into the documentation here!

source
JuliaDB.IndexSpaceType
IndexSpace(interval, boundingrect, nrows)

Metadata about an chunk.

  • interval: An Interval object with the first and the last index tuples.
  • boundingrect: An Interval object with the lowest and the highest indices as tuples.
  • nrows: A Nullable{Int} of number of rows in the NDSparse, if knowable.
source
JuliaDB.IntervalType

An interval type tailored specifically to store intervals of indices of an NDSparse object. Some of the operations on this like in or < may be controversial for a generic Interval type.

source
Base.collectMethod
collect(t::DNDSparse)

Gets distributed data in a DNDSparse t and merges it into NDSparse object

Warning

collect(t) requires at least as much memory as the size of the result of the computing t. If the result is expected to be big, try compute(save(t, "output_dir")) instead. See save for more. This data can be loaded later using load.

source
Base.getindexMethod
t[idx...]

Returns a DNDSparse containing only the elements of t where the given indices (idx) match. If idx has the same type as the index tuple of the t, then this is considered a scalar indexing (indexing of a single value). In this case the value itself is looked up and returned.

source
Base.lengthMethod

The length of the DNDSparse if it can be computed. Will throw an error if not. You can get the length of such tables after calling compute on them.

source
Base.mapMethod
map(f, t::DNDSparse)

Applies a function f on every element in the data of table t.

source
Dagger.computeMethod
compute(t::DNDSparse; allowoverlap, closed)

Computes any delayed-evaluations in the DNDSparse. The computed data is left on the worker processes. Subsequent operations on the results will reuse the chunks.

If allowoverlap is false then the computed data is re-sorted if required to have no chunks with overlapping index ranges if necessary.

If closed is true then the computed data is re-sorted if required to have no chunks with overlapping OR continuous boundaries.

See also collect.

Warning

compute(t) requires at least as much memory as the size of the result of the computing t. You usually don't need to do this for the whole dataset. If the result is expected to be big, try compute(save(t, "output_dir")) instead. See save for more.

source
Dagger.distributeFunction
distribute(itable::NDSparse, nchunks::Int=nworkers())

Distributes an NDSparse object into a DNDSparse of nchunks chunks of approximately equal size.

Returns a DNDSparse.

source
Dagger.distributeMethod
distribute(t::Table, chunks)

Distribute a table in chunks pieces. Equivalent to table(t, chunks=chunks).

source
Dagger.distributeMethod
distribute(itable::NDSparse, rowgroups::AbstractArray)

Distributes an NDSparse object into a DNDSparse by splitting it up into chunks of rowgroups elements. rowgroups is a vector specifying the number of rows in the chunks.

Returns a DNDSparse.

source
Dagger.loadMethod

load(dir::AbstractString)

Load a saved DNDSparse from dir directory. Data can be saved using the save function.

source
Dagger.saveMethod
save(t::Union{NDSparse, IndexedTable}, dest::AbstractString)

Save a dataset to disk as dest. Saved data can be loaded with load.

source
Dagger.saveMethod
save(t::Union{DNDSparse, DIndexedTable}, destdir::AbstractString)

Saves a distributed dataset to disk in directory destdir. Saved data can be loaded with load.

source
IndexedTables.convertdimMethod
convertdim(x::DNDSparse, d::DimName, xlate; agg::Function, name)

Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg is used to aggregate the results. name optionally specifies a name for the new dimension. xlate must be a monotonically increasing function.

See also reduce

source
IndexedTables.leftjoinMethod
leftjoin(left::DNDSparse, right::DNDSparse, [op::Function])

Keeps only rows with indices in left. If rows of the same index are present in right, then they are combined using op. op by default picks the value from right.

source
IndexedTables.naturaljoinMethod
naturaljoin(op, left::DNDSparse, right::DNDSparse, ascolumns=false)

Returns a new DNDSparse containing only rows where the indices are present both in left AND right tables. The data columns are concatenated. The data of the matching rows from left and right are combined using op. If op returns a tuple or NamedTuple, and ascolumns is set to true, the output table will contain the tuple elements as separate data columns instead as a single column of resultant tuples.

source
IndexedTables.naturaljoinMethod
naturaljoin(left::DNDSparse, right::DNDSparse, [op])

Returns a new DNDSparse containing only rows where the indices are present both in left AND right tables. The data columns are concatenated.

source
IndexedTables.reducedim_vecMethod
reducedim_vec(f::Function, t::DNDSparse, dims)

Like reducedim, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.

See also reducedim.

source
JuliaDB.fromchunksMethod
fromchunks(cs)

Construct a distributed object from chunks. Calls fromchunks(T, cs) where T is the type of the data in the first chunk. Computes any thunks.

source
JuliaDB.loadndsparseMethod

loadndsparse(files::Union{AbstractVector,String}; <options>)

Load an NDSparse from CSV files.

files is either a vector of file paths, or a directory name.

Options:

  • indexcols::Vector – columns to use as indexed columns. (by default a 1:n implicit index is used.)
  • datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

All other options are identical to those in loadtable

source
JuliaDB.loadtableMethod

loadtable(files::Union{AbstractVector,String}; <options>)

Load a table from CSV files.

files is either a vector of file paths, or a directory name.

Options:

  • output::AbstractString – directory name to write the table to. By default data is loaded directly to memory. Specifying this option will allow you to load data larger than the available memory.

  • indexcols::Vector – columns to use as primary key columns. (defaults to [])

  • datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

  • distributed::Bool – should the output dataset be loaded as a distributed table? If true, this will use all available worker processes to load the data. (defaults to true if workers are available, false if not)

  • chunks::Int – number of chunks to create when loading distributed. (defaults to number of workers)

  • delim::Char – the delimiter character. (defaults to ,). Use spacedelim=true to split by spaces.

  • spacedelim::Bool: parse space-delimited files. delim has no effect if true.

  • quotechar::Char – quote character. (defaults to ")

  • escapechar::Char – escape character. (defaults to ")

  • filenamecol::Union{Symbol, Pair} – create a column containing the file names from where each row came from. This argument gives a name to the column. By default, basename(name) of the name is kept, and ".csv" suffix will be stripped. To provide a custom function to apply on the names, use a name => Function pair. By default, no file name column will be created.

  • header_exists::Bool – does header exist in the files? (defaults to true)

  • colnames::Vector{String} – specify column names for the files, use this with (header_exists=false, otherwise first row is discarded). By default column names are assumed to be present in the file.

  • samecols – a vector of tuples of strings where each tuple contains alternative names for the same column. For example, if some files have the name "vendorid" and others have the name "VendorID", pass `samecols=[("VendorID", "vendorid")]`.

  • colparsers – either a vector or dictionary of data types or an AbstractToken object from TextParse package. By default, these are inferred automatically. See type_detect_rows option below.

  • type_detect_rows: number of rows to use to infer the initial colparsers defaults to 20.

  • nastrings::Vector{String} – strings that are to be considered missing values. (defaults to TextParse.NA_STRINGS)

  • skiplines_begin::Char – skip some lines in the beginning of each file. (doesn't skip by default)

  • usecache::Bool: (vestigial)

source
JuliaDB.mapchunksMethod
mapchunks(f, t::DNDSparse; keeplengths=true)

Applies a function to each chunk in t. Returns a new DNDSparse. If keeplength is false, this means that the lengths of the output chunks is unknown before compute. This function is used internally by many DNDSparse operations.

source
JuliaDB.partitionplotFunction
partitionplot(table, y;    stat=Extrema(), nparts=100, by=nothing, dropmissing=false)
partitionplot(table, x, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)

Plot a summary of variable y against x (1:length(y) if not specified). Using nparts approximately-equal sections along the x-axis, the data in y over each section is summarized by stat.

source
JuliaDB.rechunkFunction

rechunk(t::Union{DNDSparse, DNDSparse}[, by[, select]]; <options>)

Reindex and sort a distributed dataset by keys selected by by.

Optionally select specifies which non-indexed fields are kept. By default this is all fields not mentioned in by for Table and the value columns for NDSparse.

Options:

  • chunks – how to distribute the data. This can be:
    1. An integer – number of chunks to create
    2. An vector of k integers – number of elements in each of the k chunks. sum(k) must be same as length(t)
    3. The distribution of another array. i.e. vec.subdomains where vec is a distributed array.
  • merge::Function – a function which merges two sub-table or sub-ndsparse into one NDSparse. They may have overlaps in their indices.
  • splitters::AbstractVector – specify keys to split by. To create n chunks you would need to pass n-1 splitters and also the chunks=n option.
  • chunks_sorted::Bool – are the chunks sorted locally? If true, this skips sorting or re-indexing them.
  • affinities::Vector{<:Integer} – which processes (Int pid) should each output chunk be created on. If unspecified all workers are used.
  • closed::Bool – if true, the same key will not be present in multiple chunks (although sorted). true by default.
  • nsamples::Integer – number of keys to randomly sample from each chunk to estimate splitters in the sorting process. (See samplesort). Defaults to 2000.
  • batchsize::Integer – how many chunks at a time from the input should be loaded into memory at any given time. This will essentially sort in batches of batchsize chunks.
source
JuliaDB.tracktimeMethod

tracktime(f)

Track the time spent on different processes in different categories in running f.

source
IndexedTables.ColDictMethod
d = ColDict(t)

Create a mutable dictionary of columns in t.

To get the immutable iterator of the same type as t call d[]

IndexedTables.KeysType
Keys()

Select the primary keys.

Examples

t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Keys())
IndexedTables.NDSparseMethod

NDSparse(columns...; names=Symbol[...], kwargs...)

Construct an NDSparse array from columns. The last argument is the data column, and the rest are index columns. The names keyword argument optionally specifies names for the index columns (dimensions).

IndexedTables.PermType

A permutation

Fields:

  • columns: The columns being indexed as a vector of integers (column numbers)
  • perm: the permutation - an array or iterator which has the sorted permutation
Base.Broadcast.broadcastMethod
broadcast(f, A::NDSparse, B::NDSparse; dimmap::Tuple{Vararg{Int}})
A .* B

Compute an inner join of A and B using function f, where the dimensions of B are a subset of the dimensions of A. Values from B are repeated over the extra dimensions.

dimmap optionally specifies how dimensions of A correspond to dimensions of B. It is a tuple where dimmap[i]==j means the ith dimension of A matches the jth dimension of B. Extra dimensions that do not match any dimensions of j should have dimmap[i]==0.

If dimmap is not specified, it is determined automatically using index column names and types.

Example

a = ndsparse(([1,1,2,2], [1,2,1,2]), [1,2,3,4])
b = ndsparse([1,2], [1/1, 1/2])
broadcast(*, a, b)

dimmap maps dimensions that should be broadcasted:

broadcast(*, a, b, dimmap=(0,1))
Base.convertMethod
convert(IndexedTable, pkeys, vals; kwargs...)

Construct a table with pkeys as primary keys and vals as corresponding non-indexed items. keyword arguments will be forwarded to table constructor.

Example

convert(IndexedTable, Columns(x=[1,2],y=[3,4]), Columns(z=[1,2]), presorted=true)
Base.filterMethod
filter(f, t::Union{IndexedTable, NDSparse}; select)

Iterate over t and Return the rows for which f(row) returns true. select determines the rows that are given as arguments to f (see select).

f can also be a tuple of column => function pairs. Returned rows will be those for which all conditions are true.

Example

# filter iterates over ROWS of a IndexedTable
t = table(rand(100), rand(100), rand(100), names = [:x, :y, :z])
filter(r -> r.x + r.y + r.z < 1, t)

# filter iterates over VALUES of an NDSparse
x = ndsparse(1:100, randn(100))
filter(val -> val > 0, x)
Base.joinMethod
join(left, right; kw...)
join(f, left, right; kw...)

Join tables left and right.

If a function f(leftrow, rightrow) is provided, the returned table will have a single output column. See the Examples below.

If the same key occurs multiple times in either table, each left row will get matched with each right row, resulting in n_occurrences_left * n_occurrences_right output rows.

Options (keyword arguments)

  • how = :inner
    • Join method to use. Described below.
  • lkey = pkeys(left)
    • Fields from left to match on (see pkeys).
  • rkey = pkeys(right)
    • Fields from right to match on.
  • lselect = Not(lkey)
    • Output columns from left (see Not)
  • rselect = Not(rkey)
    • Output columns from right.
  • missingtype = Missing
    • Type of missing values that can be created through :left and :outer joins.
    • Other supported option is DataValue.

Join methods (how = :inner)

  • :inner – rows with matching keys in both tables
  • :left – all rows from left, plus matched rows from right (missing values can occur)
  • :outer – all rows from both tables (missing values can occur)
  • :anti – rows in left WITHOUT matching keys in right

Examples

a = table((x = 1:10,   y = rand(10)), pkey = :x)
b = table((x = 1:2:20, z = rand(10)), pkey = :x)

join(a, b; how = :inner)
join(a, b; how = :left)
join(a, b; how = :outer)
join(a, b; how = :anti)

join((l, r) -> l.y + r.z, a, b)
Base.keysMethod

keys(x::NDSparse[, select::Selection])

Get the keys of an NDSparse object. Same as rows but acts only on the index columns of the NDSparse.

Base.mapMethod
map(f, t::IndexedTable; select)

Apply f to every item in t selected by select (see also the select function). Returns a new table if f returns a tuple or named tuple. If not, returns a vector.

Examples

t = table([1,2], [3,4], names=[:x, :y])

polar = map(p -> (r = hypot(p.x, p.y), θ = atan(p.y, p.x)), t)

back2t = map(p -> (x = p.r * cos(p.θ), y = p.r * sin(p.θ)), polar)
Base.mapMethod
map(f, x::NDSparse; select = values(x))

Apply f to every value of select selected from x (see select).

Apply f to every data value in x. select selects fields passed to f. By default, the data values are selected.

If the return value of f is a tuple or named tuple the result will contain many data columns.

Examples

x = ndsparse((t=[0.01, 0.05],), (x=[1,2], y=[3,4]))

polar = map(row -> (r = hypot(row.x, row.y), θ = atan(row.y, row.x)), x)

back2x = map(row -> (x = row.r * cos(row.θ), y = row.r * sin(row.θ)), polar)
Base.mergeMethod
merge(a::IndexedTable, b::IndexedTable; pkey)

Merge rows of a with rows of b and remain ordered by the primary key(s). a and b must have the same column names.

merge(a::NDSparse, a::NDSparse; agg)

Merge rows of a with rows of b. To keep unique keys, the value from b takes priority. A provided function agg will aggregate values from a and b that have the same key(s).

Example:

a = table((x = 1:5, y = rand(5)); pkey = :x)
b = table((x = 6:10, y = rand(5)); pkey = :x)
merge(a, b)

a = ndsparse([1,3,5], [1,2,3])
b = ndsparse([2,3,4], [4,5,6])
merge(a, b)
merge(a, b; agg = (x,y) -> x)
Base.pairsMethod

pairs(arr::NDSparse, indices...)

Similar to where, but returns an iterator giving index=>value pairs. index will be a tuple.

Base.reduceMethod
reduce(f, t::IndexedTable; select::Selection)

Apply reducer function f pair-wise to the selection select in t. The reducer f can be:

  1. A function
  2. An OnlineStat
  3. A (named) tuple of functions and/or OnlineStats
  4. A (named) tuple of (selector => function) or (selector => OnlineStat) pairs

Examples

t = table(1:5, 6:10, names = [:t, :x])

reduce(+, t, select = :t)
reduce((a, b) -> (t = a.t + b.t, x = a.x + b.x), t)

using OnlineStats
reduce(Mean(), t, select = :t)
reduce((Mean(), Variance()), t, select = :t)

y = reduce((min, max), t, select=:x)
reduce((sum = +, prod = *), t, select=:x)

# combining reduction and selection
reduce((xsum = :x => +, negtsum = (:t => -) => +), t)
Base.reduceMethod
reduce(f, x::NDSparse; dims)

Drop the dims dimension(s) and aggregate values with f.

x = ndsparse((x=[1,1,1,2,2,2],
              y=[1,2,2,1,2,2],
              z=[1,1,2,1,1,2]), [1,2,3,4,5,6])

reduce(+, x; dims=1)
reduce(+, x; dims=(1,3))
Base.sort!Method
sort!(t    ; kw...)
sort!(t, by; kw...)

Sort rows of t by by in place. All of Base.sort keyword arguments can be used.

Examples

t = table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z]);
sort!(t, :z, rev = true)
t
Base.sortMethod
sort(t    ; select, kw...)
sort(t, by; select, kw...)

Sort rows by by. All of Base.sort keyword arguments can be used.

Examples

t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
sort(t, :z; select = (:y, :z), rev = true)
Base.valuesMethod

values(x::NDSparse[, select::Selection])

Get the values of an NDSparse object. Same as rows but acts only on the value columns of the NDSparse.

IndexedTables.aggregate!Method
aggregate!(f::Function, arr::NDSparse)

Combine adjacent rows with equal indices using the given 2-argument reduction function, in place.

IndexedTables.arrayofMethod
arrayof(T)

Returns the type of Columns or Vector suitable to store values of type T. Nested tuples beget nested Columns.

IndexedTables.asofjoinMethod
asofjoin(left::NDSparse, right::NDSparse)

Join rows from left with the "most recent" value from right.

Example

using Dates
akey1 = ["A", "A", "B", "B"]
akey2 = [Date(2017,11,11), Date(2017,11,12), Date(2017,11,11), Date(2017,11,12)]
avals = collect(1:4)

bkey1 = ["A", "A", "B", "B"]
bkey2 = [Date(2017,11,12), Date(2017,11,13), Date(2017,11,10), Date(2017,11,13)]
bvals = collect(5:8)

a = ndsparse((akey1, akey2), avals)
b = ndsparse((bkey1, bkey2), bvals)

asofjoin(a, b)
IndexedTables.best_perm_estimateMethod

Returns: (n, perm) where n is the number of columns in the beginning of cols, perm is one possible permutation of those first n columns.

IndexedTables.collect_columnsMethod
collect_columns(itr)

Collect an iterable as a Columns object if it iterates Tuples or NamedTuples, as a normal Array otherwise.

Examples

s = [(1,2), (3,4)]
collect_columns(s)

s2 = Iterators.filter(isodd, 1:8)
collect_columns(s2)
IndexedTables.colnamesFunction
colnames(itr)

Returns the names of the "columns" in itr.

Examples:

colnames(1:3)
colnames(Columns([1,2,3], [3,4,5]))
colnames(table([1,2,3], [3,4,5]))
colnames(Columns(x=[1,2,3], y=[3,4,5]))
colnames(table([1,2,3], [3,4,5], names=[:x,:y]))
colnames(ndsparse(Columns(x=[1,2,3]), Columns(y=[3,4,5])))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns([1,2,3], [4,5,6]), Columns(x=[6,7,8])))
colnames(ndsparse(Columns(x=[1,2,3]), Columns([3,4,5],[6,7,8])))
IndexedTables.columnsFunction
columns(itr, select::Selection = All())

Select one or more columns from an iterable of rows as a tuple of vectors.

select specifies which columns to select. Refer to the select function for the available selection options and syntax.

itr can be NDSparse, Columns, AbstractVector, or their distributed counterparts.

Examples

t = table(1:2, 3:4; names = [:x, :y])

columns(t)
columns(t, :x)
columns(t, (:x,))
columns(t, (:y, :x => -))
IndexedTables.columnsMethod

columns(itr, which)

Returns a vector or a tuple of vectors from the iterator.

IndexedTables.convertdimMethod

convertdim(x::NDSparse, d::DimName, xlate; agg::Function, vecagg::Function, name)

Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg or vecagg is used to aggregate the results. If agg is passed, it is used as a 2-argument reduction function over the data. If vecagg is passed, it is used as a vector-to-scalar function to aggregate the data. name optionally specifies a new name for the translated dimension.

IndexedTables.convertmissingMethod
convertmissing(tbl, missingtype)

Convert the missing value representation in tbl to be of type missingtype.

Example

using IndexedTables, DataValues
t = table([1,2,missing], [1,missing,3])
IndexedTables.convertmissing(t, DataValue)
IndexedTables.dimlabelsMethod

dimlabels(t::NDSparse)

Returns an array of integers or symbols giving the labels for the dimensions of t. ndims(t) == length(dimlabels(t)).

IndexedTables.dropmissingFunction
dropmissing(t        )
dropmissing(t, select)

Drop rows of table t which contain missing values (either Missing or DataValue), optionally only using the columns in select. Column types will be converted to non-missing types. For example:

  • Vector{Union{Int, Missing}} -> Vector{Int}
  • DataValueArray{Int} -> Vector{Int}

Example

t = table([0.1,0.5,missing,0.7], [2,missing,4,5], [missing,6,missing,7], names=[:t,:x,:y])
dropmissing(t)
dropmissing(t, (:t, :x))
IndexedTables.excludecolsMethod
excludecols(itr, cols) -> Tuple of Int

Names of all columns in itr except cols. itr can be any of IndexedTable, NDSparse, StructArrays.StructVector, or AbstractVector

Examples

using IndexedTables: excludecols

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

excludecols(t, (:x,))
excludecols(t, (2,))
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))
IndexedTables.flattenFunction
flatten(t::Table, col=length(columns(t)))

Flatten col column which may contain a vector of iterables while repeating the other fields. If column argument is not provided, default to last column.

Examples:

x = table([1,2], [[3,4], [5,6]], names=[:x, :y])
flatten(x, 2)

t1 = table([3,4],[5,6], names=[:a,:b])
t2 = table([7,8], [9,10], names=[:a,:b])
x = table([1,2], [t1, t2], names=[:x, :y]);
flatten(x, :y)
IndexedTables.flush!Method

flush!(arr::NDSparse)

Commit queued assignment operations, by sorting and merging the internal temporary buffer.

IndexedTables.groupbyFunction
groupby(f, t, by = pkeynames(t); select, flatten=false, usekey = false)

Apply f to the select-ed columns (see select) in groups defined by the unique values of by.

If f returns a vector, split it into multiple columns with flatten = true.

To retain the grouping key in the resulting group use usekey = true.

Examples

using Statistics

t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z])

groupby(mean, t, :x, select=:z)
groupby(identity, t, (:x, :y), select=:z)
groupby(mean, t, (:x, :y), select=:z)

groupby((mean, std, var), t, :y, select=:z)
groupby((q25=z->quantile(z, 0.25), q50=median, q75=z->quantile(z, 0.75)), t, :y, select=:z)

# apply different aggregation functions to different columns
groupby((ymean = :y => mean, zmean = :z => mean), t, :x)

# include the grouping key
groupby(t, by; usekey = true) do key, dd
    # code using key as key (named tuple) and dd as data
end
IndexedTables.groupjoinMethod
groupjoin(left, right; kw...)
groupjoin(f, left, right; kw...)

Join left and right creating groups of values with matching keys.

For keyword argument options, see join.

Examples

l = table([1,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:c], pkey=(:a, :b))
r = table([0,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:d], pkey=(:a, :b))

groupjoin(l, r)
groupjoin(l, r; how = :left)
groupjoin(l, r; how = :outer)
groupjoin(l, r; how = :anti)
IndexedTables.groupreduceFunction
groupreduce(f, t, by = pkeynames(t); select)

Calculate a reduce operation f over table t on groups defined by the values in selection by. The result is put in a table keyed by the unique by values.

Examples

t = table([1,1,1,2,2,2], 1:6, names=[:x, :y])
groupreduce(+,        t, :x; select = :y)
groupreduce((sum=+,), t, :x; select = :y)  # change output column name to :sum

t2 = table([1,1,1,2,2,2], [1,1,2,2,3,3], 1:6, names = [:x, :y, :z])
groupreduce(+, t2, (:x, :y), select = :z)

# different reducers for different columns
groupreduce((sumy = :y => +, sumz = :z => +), t2, :x)
IndexedTables.insertcolsMethod
insertcols(t, position::Integer, map::Pair...)

For each pair name => col in map, insert a column col named name starting at position. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcol(t, 2, :w => [0,1])
IndexedTables.insertcolsafterMethod
insertcolsafter(t, after, map::Pair...)

For each pair name => col in map, insert a column col named name after after. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsafter(t, :t, :w => [0,1])
IndexedTables.insertcolsbeforeMethod

insertcolsbefore(t, before, map::Pair...)

For each pair name => col in map, insert a column col named name before before. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsbefore(t, :x, :w => [0,1])
IndexedTables.map_rowsMethod
map_rows(f, c...)

Transform collection c by applying f to each element. For multiple collection arguments, apply f elementwise. Collect output as Columns if f returns Tuples or NamedTuples with constant fields, as Array otherwise.

Examples

map_rows(i -> (exp = exp(i), log = log(i)), 1:5)
IndexedTables.ncolsFunction
ncols(itr)

Returns the number of columns in itr.

Examples

ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2
IndexedTables.ndsparseFunction
ndsparse(keys, values; kw...)

Construct an NDSparse array with the given keys and values columns. On construction, the keys and data are sorted in lexicographic order of the keys.

Keyword Argument Options:

  • agg = nothing – Function to aggregate values with duplicate keys.
  • presorted = false – Are the key columns already sorted?
  • copy = true – Should the columns in keys and values be copied?
  • chunks = nothing – Provide an integer to distribute data into chunks chunks.
    • A good choice is nworkers() (after using Distributed)
    • See also: distribute

Examples:

x = ndsparse(["a","b"], [3,4])
keys(x)
values(x)
x["a"]

# Dimensions are named if constructed with a named tuple of columns
x = ndsparse((index = 1:10,), rand(10))
x[1]

# Multiple dimensions by passing a (named) tuple of columns
x = ndsparse((x = 1:10, y = 1:2:20), rand(10))
x[1, 1]

# Value columns can also have names via named tuples
x = ndsparse(1:10, (x=rand(10), y=rand(10)))
IndexedTables.pkeynamesMethod
pkeynames(t::Table)

Names of the primary key columns in t.

Examples

t = table([1,2], [3,4]);
pkeynames(t)

t = table([1,2], [3,4], pkey=1);
pkeynames(t)

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2));
pkeynames(t)
IndexedTables.pkeynamesMethod
pkeynames(t::NDSparse)

Names of the primary key columns in t.

Example

x = ndsparse([1,2],[3,4])
pkeynames(x)

x = ndsparse((x=1:10, y=1:2:20), rand(10))
pkeynames(x)
IndexedTables.pkeysMethod
pkeys(itr::IndexedTable)

Primary keys of the table. If Table doesn't have any designated primary key columns (constructed without pkey argument) then a default key of tuples (1,):(n,) is generated.

Example

a = table(["a","b"], [3,4]) # no pkey
pkeys(a)

a = table(["a","b"], [3,4], pkey=1)
pkeys(a)
IndexedTables.reducedim_vecMethod

reducedim_vec(f::Function, arr::NDSparse, dims)

Like reduce, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.

IndexedTables.reindexFunction
reindex(t::IndexedTable, by)
reindex(t::IndexedTable, by, select)

Reindex table t with new primary key by, optionally keeping a subset of columns via select. For NDSparse, use selectkeys.

Example

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

t2 = reindex(t, (:y, :z))

pkeynames(t2)
IndexedTables.renameMethod
rename(t, map::Pair...)

For each pair col => newname in map, set newname as the new name for column col in t. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], names=[:t, :x])
rename(t, :t => :time)
IndexedTables.rowsFunction
rows(itr, select = All())

Select one or more fields from an iterable of rows as a vector of their values. Refer to the select function for selection options and syntax.

itr can be NDSparse, StructArrays.StructVector, AbstractVector, or their distributed counterparts.

Examples

t = table([1,2],[3,4], names=[:x,:y])
rows(t)
rows(t, :x)
rows(t, (:x,))
rows(t, (:y, :x => -))
IndexedTables.selectMethod
select(t::Table, which::Selection)

Select all or a subset of columns, or a single column from the table.

Selection is a type union of many types that can select from a table. It can be:

  1. Integer – returns the column at this position.
  2. Symbol – returns the column with this name.
  3. Pair{Selection => Function} – selects and maps a function over the selection, returns the result.
  4. AbstractArray – returns the array itself. This must be the same length as the table.
  5. Tuple of Selection – returns a table containing a column for every selector in the tuple.
  6. Regex – returns the columns with names that match the regular expression.
  7. Type – returns columns with elements of the given type.

Examples:

t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])

# select the :x vector
select(t, 1)
select(t, :x)

# map a function to the :y vector
select(t, 2 => abs)
select(t, :y => x -> x > 0 ? x : -x)

# select the table of :x and :z
select(t, (:x, :z))
select(t, r"(x|z)")

# map a function to the table of :x and :y
select(t, (:x, :y) => row -> row[1] + row[2])
select(t, (1, :y) => row -> row.x + row.y)
IndexedTables.stackMethod
stack(t, by = pkeynames(t); select = Not(by), variable = :variable, value = :value)`

Reshape a table from the wide to the long format. Columns in by are kept as indexing columns. Columns in select are stacked. In addition to the id columns, two additional columns labeled variable and value are added, containing the column identifier and the stacked columns. See also unstack.

Examples

t = table(1:4, names = [:x], pkey=:x)
t = pushcol(t, :xsquare, :x => x -> x^2)
t = pushcol(t, :xcube  , :x => x -> x^3)

stack(t)
IndexedTables.summarizeFunction
summarize(f, t, by = pkeynames(t); select = Not(by), stack = false, variable = :variable)

Apply summary functions column-wise to a table. Return a NamedTuple in the non-grouped case and a table in the grouped case. Use stack=true to stack results of the same summary function for different columns.

Examples

using Statistics

t = table([1, 2, 3], [1, 1, 1], names = [:x, :y])

summarize((mean, std), t)
summarize((m = mean, s = std), t)
summarize(mean, t; stack=true)
summarize((mean, std), t; select = :y)
IndexedTables.tableFunction
table(cols; kw...)

Create a table from a (named) tuple of AbstractVectors.

table(cols::AbstractVector...; names::Vector{Symbol}, kw...)

Create a table from the provided cols, optionally with names.

table(cols::Columns; kw...)

Construct a table from a vector of tuples. See rows and Columns.

table(t::Union{IndexedTable, NDSparse}; kw...)

Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.

table(x; kw...)

Create an IndexedTable from any object x that follows the Tables.jl interface.

Keyword Argument Options:

  • pkey: select columns to sort by and be the primary key.
  • presorted = false: is the data pre-sorted by primary key columns?
  • copy = true: creates a copy of the input vectors if true. Irrelevant if chunks is specified.
  • chunks::Integer: distribute the table. Options are:
    • Int – (number of chunks) a safe bet is nworkers() after using Distributed.
    • Vector{Int} – Number of elements in each of the length(chunks) chunks.

Examples:

table(rand(10), rand(10), names = [:x, :y], pkey = :x)

table(rand(Bool, 20), rand(20), rand(20), pkey = [1,2])

table((x = 1:10, y = randn(10)))

table([(1,2), (3,4)])
IndexedTables.unstackMethod
unstack(t, by = pkeynames(t); variable = :variable, value = :value)

Reshape a table from the long to the wide format. Columns in by are kept as indexing columns. Keyword arguments variable and value denote which column contains the column identifier and which the corresponding values. See also stack.

Examples

t = table(1:4, [1, 4, 9, 16], [1, 8, 27, 64], names = [:x, :xsquare, :xcube], pkey = :x);

long = stack(t)

unstack(long)
IndexedTables.update!Method

update!(f::Function, arr::NDSparse, indices...)

Replace data values x with f(x) at each location that matches the given indices.

IndexedTables.whereMethod

where(arr::NDSparse, indices...)

Returns an iterator over data items where the given indices match. Accepts the same index arguments as getindex.

StatsBase.transformMethod
transform(t::Table, changes::Pair...)

Transform columns of t. For each pair col => value in changes the column col is replaced by the AbstractVector value. If col is not an existing column, a new column is created.

Examples:

t = table([1,2], [3,4], names=[:x, :y])

# change second column to [5,6]
transform(t, 2 => [5,6])
transform(t, :y => :y => x -> x + 2)

# add [5,6] as column :z
transform(t, :z => 5:6)
transform(t, :z => :y => x -> x + 2)

# replacing the primary key results in a re-sorted copy
t = table([0.01, 0.05], [1,2], [3,4], names=[:t, :x, :y], pkey=:t)
t2 = transform(t, :t => [0.1,0.05])

# the column :z is not part of t so a new column is added
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
transform(t, :z => [1//2, 3//4])