---
title: "Get started with lagseq"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Get started with lagseq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", message = FALSE,
                      warning = FALSE, dpi = 150, fig.width = 7,
                      fig.height = 5.6, out.width = "100%",
                      fig.align = "center")
set.seed(1)
library(lagseq)
options(digits = 3)
has <- function(p) requireNamespace(p, quietly = TRUE)
```

# The Method

Lag-sequential analysis (LSA) studies ordered categorical processes. It
asks whether, given that one state has just occurred, another state
follows more often or less often than chance. The unit of analysis is the
transition from a source state to a target state, written as `from -> to`.

The reference model is independence. Under independence, the current
state has no sequential memory: the next state is expected only from the
overall marginal distribution of states, not from the state that preceded
it. LSA compares each observed transition count with the count expected
under that no-memory model. The adjusted residual is the standardized
departure from that expectation. A positive residual means that the
transition is over-represented. A negative residual means that the
transition is avoided. Values near zero indicate that the transition is
close to what independence would predict.

The residual is therefore the test statistic for a local sequential
claim. A transition with a large residual is not merely common. It is a
transition whose observed frequency departs from the chance baseline
defined by the row and column margins of the transition table.

# Why lagseq

LSA has a long history in behavioral, educational, and interaction
research, but applied studies often stop at description. Observed
transition rates are reported, plotted, and interpreted, while the
inferential question is left implicit: whether the apparent pattern is
larger than expected under chance sequential pairing.

Existing tooling has also tended to be dated, object-centric, or
difficult to combine with modern data workflows. `lagseq` modernizes this
practice. It implements lag-sequential analysis as a tidy, visual,
interoperable, and confirmatory workflow. Each result is returned by a
verb as a one-row-per-observation data frame. The fitted model can be
read, plotted, resampled, compared across groups, and converted to
network objects without indexing into internal slots.

The design follows the Dynalytics framework of Saqr, Lopez-Pernas, and
Misiejuk (2026). Dynalytics treats analysis as a scientific contract:
every analytical claim must be matched by evidence appropriate to the
structure, scope, and complexity of the claim. In `lagseq`, an edge is a
tested departure from independence. Descriptive claims use descriptive
evidence, edge claims use edge-level uncertainty, whole-network claims
use reproducibility evidence, and group-difference claims use comparison
procedures that preserve the empirical unit of the sequence.

# Hands-On Tour

The main example is `group_regulation`, a bundled wide-format data set of
2,000 collaborative-learning sessions. Each row is one sequence, each
column is an ordered time point, and each cell is a coded regulation
action. The smaller `engagement` data set is also bundled and is used
below to show the same input grammar on weekly engagement trajectories.

```{r data}
dim(group_regulation)
head(group_regulation)

dim(engagement)
head(engagement)
```

## Fit

`lsa()` fits the model in one call. It tabulates the transition matrix,
derives expected counts under independence, computes adjusted residuals,
and records the recipe used to estimate the model. The printed object is
a compact summary. The fitted object itself is read through verbs.

```{r fit}
fit <- lsa(group_regulation)
fit
```

## Read a Fit

The public surface is tidy. Every result is produced by a verb, and each
verb returns a data frame with one row per observation. `transitions()`
returns one row per `from -> to` cell. The columns contain observed
counts, expected counts, transition probabilities, adjusted residuals,
p-values, and related association statistics.

```{r transitions}
transitions(fit) |> head()   # one row per from -> to transition
```

The companion verbs read the other parts of the same fitted model.
`nodes()` reports incoming and outgoing volume by state. `tests()`
reports tablewise independence tests. `initial()` reports the
initial-state distribution when the input contains sequences rather than
a precomputed transition matrix.

```{r nodes-tests-initial}
nodes(fit)                  # one row per state, with in/out totals
tests(fit)                  # tablewise independence tests (G^2, chi-square)
initial(fit)                # initial-state distribution
```

`summary()` returns the same tidy discipline at the model level: one row
per fitted model. For a grouped model, the same verb returns one row per
group.

```{r summary}
summary(fit)                # one-row overview
```

Focused transition tables are requested through arguments to
`transitions()`. The call states the scientific filter directly:
significant transitions, over-represented transitions, avoided
transitions, or transitions with a minimum observed count.

```{r filters}
transitions(fit, significant = TRUE) |> head()    # significant only
transitions(fit, direction = "over")  |> head(4)  # over-represented
transitions(fit, direction = "under") |> head(4)  # avoided
transitions(fit, min_count = 500)     |> head(4)  # frequently observed
```

## Visualise

One plotting verb covers the main views: `plot(fit, type = )`. The
default heatmap displays adjusted residuals, which makes it the direct
visual form of the LSA model. The residual network displays the same
departures as directed edges. A probability-weighted network switches the
representation to a TNA transition network. Chord and sunburst views
summarize transition flow and outgoing distributions. Uncertainty
forests and group barrels are produced by `plot()` on validation and
comparison results.

```{r plot-residuals, eval = requireNamespace("ggplot2", quietly = TRUE)}
plot(fit)                     # residual heatmap (default)
```

```{r plot-network, eval = requireNamespace("cograph", quietly = TRUE)}
plot(fit, type = "network")   # residual network
```

```{r plot-tna-network, eval = requireNamespace("tna", quietly = TRUE)}
plot(fit, type = "network", weights = "prob")   # TNA transition network
```

```{r plot-chord, eval = requireNamespace("cograph", quietly = TRUE)}
plot(fit, type = "chord")
```

```{r plot-sunburst, eval = requireNamespace("ggplot2", quietly = TRUE)}
plot(fit, type = "sunburst")
```

See `vignette("plotting", package = "lagseq")` for the full plotting
gallery.

## Get Data In

The canonical input is a sequence object: one or more categorical
sequences with missing cells dropped as missingness rather than modeled
as a state. Wide data frames and matrices use rows as sequences and
columns as ordered positions. This is the shape of both bundled data
sets.

```{r input}
lsa(engagement)
lsa(group_regulation)
```

The canonical sequence representation and the raw transition table can
also be inspected directly. These helpers are useful when the sequencing
contract needs to be made explicit before a model is fitted.

```{r data-helpers}
as.data.frame(lsa_data(group_regulation)) |> head(6)
as.data.frame(lsa_transitions(group_regulation)) |> head(6)
```

Raw event logs are handled with the same `actor` / `action` / `time`
grammar used by `tna`. The chunk below uses the real
`tna::group_regulation_long` log when `tna` is installed. The call
groups events by actor, orders them by time, constructs sequences, and
then fits the LSA model.

```{r long-log, eval = requireNamespace("tna", quietly = TRUE)}
log <- tna::group_regulation_long
head(log)

fit_log <- lsa(log, actor = "Actor", action = "Action", time = "Time")
fit_log
```

Interoperability also works at the object level. Sequence-bearing
objects from `tna`, Nestimate, and `TraMineR` can be passed to `lsa()`;
fitted LSA models can be converted back to a TNA object with
`lsa_to_tna()`.

```{r ingest-tna, eval = requireNamespace("tna", quietly = TRUE)}
log <- tna::group_regulation_long
prepared <- tna::prepare_data(log, actor = "Actor",
                              action = "Action", time = "Time")
lsa(tna::tna(prepared))
```

```{r ingest-traminer, eval = requireNamespace("TraMineR", quietly = TRUE)}
seq_obj <- TraMineR::seqdef(engagement)
lsa(seq_obj)
```

## Lags

The lag controls the temporal distance between the source and target
states. Positive lags count successors, negative lags count
predecessors, and lag zero pairs each event with itself. `lag_profile()`
tracks one transition across several lags. `lsa_lags()` fits a full
multi-lag profile and returns a tidy long table through
`as.data.frame()`.

```{r lags}
lag_profile(group_regulation, "plan", "consensus", lags = 1:3)
as.data.frame(lsa_lags(group_regulation, lags = 1:2)) |> head(3)
```

## Groups

A grouped model estimates one LSA fit per group under one shared label
set. For raw long logs, `group` names a grouping column that is constant
within each recovered sequence. The same reading verbs then return a
leading `group` column.

```{r groups, eval = requireNamespace("tna", quietly = TRUE)}
log <- tna::group_regulation_long
gfit <- lsa(log, actor = "Actor", action = "Action", time = "Time",
            group = "Achiever")
gfit
transitions(gfit, significant = TRUE) |> head(4)
nodes(gfit) |> head(4)
summary(gfit)
```

## Validate and Confirm

The adjusted residual tests whether an edge departs from independence in
the fitted table. Stronger claims require further evidence. `lagseq`
therefore includes a confirmatory battery that follows the Dynalytics
contract.

`certainty_lsa()` gives analytic Bayesian uncertainty for transition
probabilities. It uses a Dirichlet-Multinomial posterior for each source
state's outgoing transitions and returns credible intervals without
resampling.

```{r certainty}
cert <- certainty_lsa(fit)
cert
as.data.frame(cert) |> head(3)
```

`bootstrap_lsa()` is the resampling counterpart. It resamples whole
sequences, refits the model, and summarizes edge-level variability. A
forest plot displays the interval evidence for the selected edge metric.

```{r bootstrap}
boot <- bootstrap_lsa(fit, R = 50)
as.data.frame(boot) |> head(3)
```

```{r forest, eval = requireNamespace("ggplot2", quietly = TRUE)}
plot(boot)
```

`reliability_lsa()` estimates whole-network reproducibility by repeated
split-half refitting. `stability_lsa()` drops cases and records whether
edges remain significant. `permute_lsa()` constructs an empirical null
by shuffling events within sequences and recomputing the residuals.

```{r validation}
rel <- reliability_lsa(fit, R = 20)
rel
as.data.frame(rel) |> head(3)

stab <- stability_lsa(fit, R = 30)
stab
as.data.frame(stab) |> head(3)

pm <- permute_lsa(fit, R = 50)
pm
as.data.frame(pm) |> head(3)
```

Group comparison is confirmatory rather than visual. `compare_lsa()`
permutes group labels at the sequence level and tests whether transition
structures differ. `bayes_compare_lsa()` is the analytic Bayesian
comparison: it reports the plausible range of group differences in
transition probabilities.

```{r compare, eval = requireNamespace("tna", quietly = TRUE)}
log <- tna::group_regulation_long
gfit <- lsa(log, actor = "Actor", action = "Action", time = "Time",
            group = "Achiever")

cmp <- compare_lsa(gfit, R = 50, adjust = "BH")
cmp
as.data.frame(cmp) |> head(4)

bcmp <- bayes_compare_lsa(gfit, draws = 1000, adjust = "BH", seed = 1)
bcmp
as.data.frame(bcmp) |> head(4)
```

```{r barrel, eval = requireNamespace("tna", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)}
plot(cmp)
plot(cmp, style = "heatmap")
```

## Engines

The classical LSA engine is the default, but the package also includes
two-cell, bidirectional, parallel-dominance, and nonparallel-dominance
engines. The registry is open, so custom engines can be registered with
`register_lsa_engine()` and removed with `unregister_lsa_engine()`.
Convenience wrappers call the built-in engines directly.

```{r engines}
list_lsa_engines()
lsa_two_cell(group_regulation)
lsa_bidirectional(group_regulation)
```

## Structural Zeros

Some transitions are impossible by design. `loops = FALSE` is the common
constraint when self-transitions are not part of the model. A full
0/1 matrix can be supplied through `structural_zeros` for arbitrary
forbidden patterns. Structural-zero cells are not treated as expected
transitions; they are non-estimable cells in the model.

```{r structural-zeros}
fz <- lsa(group_regulation, loops = FALSE)   # forbid self-transitions
transitions(fz) |> subset(from == to)
```

## Network Toolkits

An LSA fit is also a directed weighted network. `lsa_to_tna()` exposes it
to the TNA ecosystem. The choice of weight carries the model meaning:
`prob` is a transition network, `count` is observed transition volume,
`adj_res` is the residual LSA network, and `lift` is observed over
expected association strength.

```{r tna, eval = requireNamespace("tna", quietly = TRUE)}
tna::centralities(lsa_to_tna(fit, weights = "prob")) |> head()
```

# Ecosystem

`lagseq` sits beside several established approaches to process data.
Process mining discovers and checks event-log workflows. Social network
analysis models relations among actors. Sequence analysis models whole
trajectories and distances between them. Transition network analysis
models conditional movement from one state to the next.

LSA is closest to transition network analysis, but it answers a different
question. A TNA edge is a transition tendency, usually a conditional
probability. An LSA edge is a tested departure from independence. The
same empirical process can therefore produce both models, but the
scientific claims are different. TNA asks where the process tends to go
next. LSA asks which transitions are more or less frequent than expected
under a no-memory baseline.

Within the Dynalytics open-source ecosystem, `tna`, Nestimate, and
`cograph` provide related representations, estimation tools, and
visualization layers for dynamical systems. `lagseq` contributes the
lag-sequential and confirmatory member of that ecosystem. It keeps the
methodological contract explicit: each edge is tested against
independence, each uncertainty claim is matched to uncertainty evidence,
each whole-network claim is matched to reliability or stability
evidence, and each group claim is matched to a comparison procedure.

In short, the package turns a familiar descriptive practice into a
complete inferential workflow:

```r
fit <- lsa(group_regulation)                  # fit
transitions(fit)                              # tidy edge table
transitions(fit, significant = TRUE)          # tested departures
nodes(fit); tests(fit); initial(fit)          # other tidy results
summary(fit)                                  # one-row overview
plot(fit)                                     # residual heatmap
lag_profile(group_regulation, "plan", "consensus")  # multi-lag profile
certainty_lsa(fit); bootstrap_lsa(fit)        # edge uncertainty
reliability_lsa(fit); stability_lsa(fit)      # network evidence
permute_lsa(fit)                              # empirical null
```

See `vignette("workflow", package = "lagseq")` for a fuller
claim-to-evidence workflow and `?lsa` for the complete input grammar.