--- title: "A complete workflow: from sequences to a group comparison" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A complete workflow: from sequences to a group comparison} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, dpi = 150, fig.width = 7, fig.height = 5.6, out.width = "100%", fig.align = "center") set.seed(2026) library(lagseq) options(digits = 3) has <- function(p) requireNamespace(p, quietly = TRUE) ``` Lag-sequential analysis (LSA) asks whether, given a state at one moment, another state follows it more or less often than chance. The method has a long history in the behavioural sciences, but applied work has often stopped at description -- reading off observed transition rates -- without the inferential machinery needed to establish that a pattern reflects the process rather than the particular sample. `lagseq` implements LSA as a member of the Dynalytics framework (Saqr, Lopez-Pernas, and Misiejuk, 2026), whose central tenet is a *scientific contract*: every analytical claim must be matched by evidence appropriate to its structure, scope, and complexity, and estimation, validation, and interpretation are treated as one inferential system rather than separate steps. Accordingly, in `lagseq` each edge is a *tested departure from independence* -- the adjusted residual is a test statistic, not a descriptive weight. The package is built around a small set of design commitments: - **Tidy.** Every result is produced by a verb (`transitions()`, `nodes()`, `tests()`, `initial()`, ...) that returns a one-row-per-observation data frame; the user never indexes into an object. - **Interoperable with the tna family.** `lsa()` ingests sequence-bearing objects from `tna`, Nestimate, and `TraMineR` and shares their `actor` / `action` / `time` grammar, while a fitted model converts back to a `tna` network in one call. - **Rigorous, following Dynalytics.** The confirmatory testing battery is built in: analytic certainty and the bootstrap for edge-level uncertainty, split-half reliability for the whole network, case-dropping for stability, and permutation under exchangeability for group comparison. - **Expanded.** Beyond the classical engine the package offers several model variants, multi-lag analysis, structural-zero constraints, and an analytic Bayesian certainty estimator. - **Visual.** A single `plot()` verb renders the model as a residual heatmap, a residual network, a probability-weighted TNA transition network, a chord diagram, or a polar sunburst, alongside dedicated plots for uncertainty and for group differences. - **Group-aware.** A grouping column yields one model per group, and `compare_lsa()` tests whether the groups differ by permutation. The remainder of this vignette walks one analysis end to end on real bundled data, treating each step as one tier of the claim-to-evidence map: descriptive views support claims about which transitions exist; certainty intervals and bootstraps support claims about particular edges; split-half reliability supports claims about the reproducibility of the whole network; and permutation tests support claims that two groups differ. # The data: engagement trajectories `engagement` is 138 students observed over 15 weekly time points; each cell is that week's engagement state - `Active`, `Average`, or `Disengaged`. It is a wide matrix: **one sequence per row**. This section anchors the scope of every later claim: the units are students, the events are weekly engagement states, and the possible states are known before the model is fitted. In Dynalytics terms, this is the descriptive foundation on which the evidential contract rests. Before asking whether an edge is surprising or a group differs, we first make clear what process the model is allowed to represent. ```{r data} dim(engagement) head(engagement) ``` # Fit `lsa()` estimates the model in a single call. It tabulates every state-to-state transition, derives the frequencies expected under the independence (no-memory) model, and computes an **adjusted residual** for each cell: a standardised measure of how far the observed frequency departs from its expectation. The fitted object is therefore a *residual network* -- each edge is a tested deviation from independence rather than a raw rate. This is the named-model discipline of Dynalytics: the same sequences could be re-expressed as a probability-weighted transition (TNA) network, but that is a distinct model answering a distinct question -- the typical next state rather than the surprising one. ```{r fit} fit <- lsa(engagement) fit ``` # Read the fit Every result is produced by a verb that returns a tidy data frame; the object is never indexed directly. `transitions()` is the primary accessor, and its arguments select the transitions of interest. This step supports the most modest claim in the workflow -- which departures from chance are present in the fitted model. The evidence is descriptive and edge-level, appropriate for reading the model but not yet for claims about stability, replication, or group differences. ```{r read} transitions(fit, significant = TRUE) # transitions beyond chance ``` A positive residual means the transition happens **more** than chance (over-represented); a negative one means it is **avoided**. The other results follow the same one-verb pattern. The companion tables provide the base rates, node volumes, starting states, and tablewise tests that keep interpretation grounded in the observed process. They help prevent an edge from being read in isolation, which is part of the Dynalytics idea that estimation and interpretation must stay connected. ```{r read2} nodes(fit) # per-state incoming / outgoing volume tests(fit) # tablewise independence tests initial(fit) # where sequences start ``` # Plot - the full gallery Every view is produced by `plot(fit, type = )`, and each answers a different descriptive question. Visualisation is the first evidence tier: it can support claims about pattern, concentration, and direction, but it is not confirmatory evidence about uncertainty or group differences. ## Residual heatmap This is the default view. Rows are the current state, columns the next, and colour is the adjusted residual, so the heatmap displays the departure-from-chance model directly. Strong colours mark transitions whose observed counts sit far from the independence expectation, which makes the view well matched to descriptive claims about where the process departs from chance. ```{r heatmap, eval = has("ggplot2")} plot(fit) # type = "heatmap" ``` ## Residual network The same residuals as a directed graph. The convention matches TNA and Nestimate: **blue = more than chance** (solid edges), **red = less** (dashed, with a soft halo). It shows *which* transitions are surprising. The graph keeps the residual scale but makes direction and connectedness easier to read. It is still the RESIDUAL model: blue and red edges are not simply popular or unpopular transitions, but tested deviations from what independence would predict. ```{r net-residual, eval = has("cograph")} plot(fit, type = "network") ``` ## Transition network (a TNA model) Weighting the edges by probability instead yields the familiar transition network. `lagseq` constructs a `tna` object on the fly and renders it with tna's styling: coloured nodes, initial-probability arcs, and weighted directed edges. The named model changes even though the underlying sequences do not. A transition (TNA) network is a first-order Markov model whose edge weights are conditional probabilities; it is the appropriate display when the claim concerns the typical next state, whereas the residual network is appropriate when the claim concerns a tested departure from chance. ```{r net-transition, eval = has("tna")} plot(fit, type = "network", weights = "prob") ``` ## Chord and sunburst Two further views connect local edges to the overall movement of the process: a chord diagram of the transition flow, and a polar sunburst of each state's outgoing distribution. Both are descriptive - for example, they make clear when one state distributes its outgoing transitions broadly while another concentrates them in a single destination. They do not replace the uncertainty checks that follow; they make the fitted structure easier to inspect before stronger claims are made. ```{r chord, eval = has("cograph")} plot(fit, type = "chord") ``` ```{r sunburst, eval = has("ggplot2")} plot(fit, type = "sunburst") ``` # Verify and validate To validate the trustworthiness of an estimated edge, its uncertainty must be quantified. This is where the Dynalytics contract becomes explicit: a claim about a specific edge requires edge-level evidence, because a single observed value is weaker support than a measure of how much that value could vary. `lagseq` provides two complementary procedures, both returning a tidy object. `certainty_lsa()` is the analytic procedure. It places a Dirichlet-Multinomial posterior on each state's outgoing transitions, which gives an exact credible interval for every transition probability in closed form, without resampling. The interval states directly how precisely the data pin down each edge, and so provides the evidence appropriate to a claim about a specific transition. ```{r certainty} certainty_lsa(fit) as.data.frame(certainty_lsa(fit)) |> head(4) ``` `bootstrap_lsa()` is the resampling counterpart. It resamples whole sequences with replacement, re-estimates the model on each resample, and summarises how much each edge varies. The two procedures agree when the population is homogeneous; the bootstrap is preferred when it is a mixture, because resampling whole sequences preserves the within-sequence dependence that the analytic model treats as independent. The forest plot below shows each edge's interval. Both procedures supply the edge-level evidence a claim about a single transition requires: a narrow, stable interval indicates the transition is not an artefact of the particular sample. ```{r bootstrap} as.data.frame(bootstrap_lsa(fit, R = 200)) |> head(4) ``` ```{r forest, eval = has("ggplot2")} plot(bootstrap_lsa(fit, R = 200)) # circular forest of edge CIs ``` `reliability_lsa()` raises the claim from a single edge to the whole network. It repeatedly splits the sequences into two halves, estimates a model on each half, and correlates the two edge-weight vectors. A high average correlation indicates that the network is reproducible within the sample; a low one indicates that the structure is sample-dependent and should be interpreted with caution. This evidence matches a claim about the network as a whole rather than about any single edge. ```{r reliability} reliability_lsa(fit, R = 30) ``` # Starting from a raw event log `engagement` arrived as ready-made sequences. Real data is often a **long event log** instead - one row per event. `lsa()` sequences it on the fly with the same `actor` / `action` / `time` grammar that `tna` uses. This step shows that the same inferential chain can begin from raw process traces rather than from an already assembled sequence matrix. The data format changes, but the contract does not: define the actors, events, and ordering rule before estimating a model from them. `tna::group_regulation_long` is exactly that: 2,000 students, each a stream of timestamped regulation actions, with a recorded achievement group. The grouping variable will matter later, but it is first treated as metadata attached to recovered sequences. Keeping sequencing and grouping explicit helps preserve the units that later resampling and permutation procedures require. ```{r log, eval = has("tna")} log <- tna::group_regulation_long head(log) ``` A single call converts the log into a fitted model: `lsa()` groups events by actor, orders them by time, splits sessions on long gaps, and fits. Those operations are not clerical details; they define the empirical process to which the residual test is applied. Once the log has been converted to ordered transitions, the fitted edges again represent departures from independence in that recovered sequence process. ```{r log-fit, eval = has("tna")} fit_log <- lsa(log, actor = "Actor", action = "Action", time = "Time") fit_log ``` That grammar is shared deliberately, so the objects interoperate. The data can be prepared with tna and passed directly to `lsa()`, or a fit can be converted the other way for tna tooling. Interoperability is useful because Dynalytics separates the model from the rendering and analysis layers. The same empirical object can be read as a residual lag-sequential model when testing departures from chance, or passed into transition-network tooling when the question is about probability-weighted flow. ```{r interop, eval = has("tna")} lsa(tna::tna(tna::prepare_data(log, actor = "Actor", action = "Action", time = "Time"))) lsa_to_tna(fit_log, weights = "prob") # fit -> a tna network object ``` # Group To compare achievement groups, name the grouping **column** in the same call - `group = "Achiever"`. `lsa()` derives one label per recovered sequence (the column must be fixed within each actor), so there is no manual splitting or relabelling. Grouped fitting estimates one lag-sequential model per group under the same data grammar, which makes the models comparable. At this point the claim is still descriptive: each group has its own fitted residual structure. ```{r group, eval = has("tna")} gfit <- lsa(log, actor = "Actor", action = "Action", time = "Time", group = "Achiever") gfit ``` Every verb now returns grouped, tidy results with a leading `group` column. This makes it easy to inspect apparent differences without turning visual contrast into inference. Side-by-side networks almost always differ somewhere, so the next step asks whether those differences are larger than expected under a valid null. ```{r group-read, eval = has("tna")} transitions(gfit, significant = TRUE) |> head(4) ``` # Compare the groups `compare_lsa()` tests whether the two groups have a different transition structure. It permutes the group labels and compares an N-invariant effect size (the log-odds ratio), returning a per-edge table together with a single omnibus test of whether the groups differ at all. This is the evidence tier appropriate to a comparative claim. Under the null hypothesis of no group difference, group labels are exchangeable, so shuffling them builds a reference distribution for any chosen difference function. The observed edge differences and the omnibus statistic are then judged against that permutation distribution rather than against the visual fact that two separately estimated networks are not identical. ```{r compare, eval = has("tna")} cmp <- compare_lsa(gfit, R = 500, adjust = "BH") cmp ``` The back-to-back **barrel** displays the result: each group's bar runs to one side, the higher group's bar is bordered (darker for a larger difference), and the centre chip carries the difference p-value. The plot is a compact rendering of the permutation result, not a substitute for it: the bars show direction and magnitude, while the centre chip records whether the observed difference is large relative to the differences produced by label shuffling under exchangeability. ```{r barrel, eval = has("tna") && has("ggplot2")} plot(cmp) ``` The same comparison as a full-grid difference heatmap. Use it when the claim is easier to assess as a matrix of source-by-target contrasts than as paired bars. The underlying evidence remains the same permutation comparison; only the visual summary changes. ```{r diff-heatmap, eval = has("tna") && has("ggplot2")} plot(cmp, style = "heatmap") ``` # In short The compact recipe below is the same workflow expressed as a checklist: estimate the named model, read and visualise its descriptive structure, validate edge-level uncertainty, move to logs when needed, preserve group labels at the sequence level, and use permutation for group comparisons. Each step escalates the evidence only when the claim being made also escalates. ```r fit <- lsa(engagement) # 1. fit ready sequences transitions(fit, significant = TRUE) # 2. read (one verb per result) plot(fit) # 3. plot: heatmap / network / plot(fit, type = "network", weights = "prob") # chord / sunburst, TNA network certainty_lsa(fit); bootstrap_lsa(fit) # 4. validate (analytic + resampling) log <- tna::group_regulation_long # 5. or start from a raw log fit <- lsa(log, actor =, action =, time =) # TNA grammar, one call gfit <- lsa(log, actor =, action =, time =, group = "Achiever") # 6. group by a column compare_lsa(gfit) # 7. compare + plot(cmp) ``` Taken together, the workflow is a complete inferential chain. Descriptive claims receive descriptive support, edge claims receive edge-level uncertainty, whole-network claims receive reproducibility evidence, and group-difference claims receive permutation evidence under exchangeability. That is the Dynalytics scientific contract in practice: every interpretation is tied to the evidence tier that can actually support it.