--- title: "Parsing and normalising author names" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parsing and normalising author names} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(bibnets) ``` ## What `parse_names()` is `parse_names()` is an **optional, standalone utility** for cleaning author name strings. It does two things: 1. Reorders names to `"First Last"` (or other styles). 2. Breaks each name into components (`first`, `last`, `particle`, `suffix`), returned as the `"parts"` attribute. It is **not** called by any reader or network builder. bibnets matches entity labels verbatim; you opt in to normalisation by calling this function yourself. ```{r} parse_names(c("Saqr, Mohammed", "Lopez-Pernas, Sonsoles")) ``` ## The three name conventions Bibliometric exports use three incompatible conventions. `parse_names()` recognises all three; the rule is decided per string. | Input | Convention | Detected by | |---|---|---| | `"Saqr, Mohammed"` | `Last, First` | the comma | | `"WANG Y"` | `SURNAME Initials` (Scopus/bibnets) | trailing uppercase 1–3 letter token | | `"Mohammed Saqr"` | `First Last` | default for comma-less, non-initial | ```{r} parse_names(c("Saqr, Mohammed", "WANG Y", "Mohammed Saqr")) ``` A comma always means `Last, First`. For comma-less strings the `surname_first` argument controls interpretation: * `"auto"` (default) — surname-first **iff** the trailing token looks like initials (all-uppercase, 1–3 letters). This is the *bibnets-takes-precedence* bias: native bibnets/Scopus labels parse correctly with no extra arguments, and ordinary mixed-case `"First Last"` is never misread. * `"yes"` / `TRUE` — force surname-first. * `"no"` / `FALSE` — force given-first (comma-less returned unchanged). ```{r} parse_names("Wang Yong", surname_first = "yes") # force surname-first parse_names("WANG Y", surname_first = "no") # force given-first ``` Particles and suffixes are handled, and detection is case-insensitive so it works on bibnets' upper-cased labels: ```{r} parse_names(c("van der Berg, Jan", "Smith, John, Jr.", "DE LA CRUZ, ANA", "VAN DER BERG J")) ``` Group / corporate authors, `NA`, and empty strings are left untouched: ```{r} parse_names(c("WHO Collaborating Group", NA, "")) ``` ## Output styles: `format` ```{r} nm <- c("Saqr, Mohammed", "van der Berg, Jan", "Garcia Marquez, Gabriel Jose") data.frame( first_last = parse_names(nm), last_initials = parse_names(nm, format = "last_initials"), last = parse_names(nm, format = "last") ) ``` ## The `"parts"` attribute The parsed components ride along on every call, independent of `format`: ```{r} x <- parse_names(c("van der Berg, Jan", "Smith, John, Jr.")) attr(x, "parts") ``` `type` is one of `"person"`, `"organization"`, `"empty"`, `"missing"`. ## Input shape: vector, not data frame `parse_names()` works on **one flat character vector**. It is not a data-frame function. bibnets readers store authors as a **list-column**: each paper has a variable number of authors, so the cell holds a *vector*, not a single string. ```{r} papers <- data.frame(id = c("P1", "P2", "P3"), stringsAsFactors = FALSE) papers$authors <- list( c("Saqr, Mohammed", "Lopez, Ana"), c("SAQR M", "Lopez, Ana"), c("Saqr, Mohammed", "Chen, Wei")) papers$authors ``` Map the function over the list-column with `lapply()`: ```{r} papers$authors <- lapply(papers$authors, parse_names, format = "last_initials") papers$authors ``` A flat character column (or a network's `from` / `to`) is called directly, no `lapply()`: ```{r} parse_names(c("WANG Y", "AYALA-ROMERO JA")) ``` ## Recommended workflow: normalise *before* building Node identity in bibnets is fixed when the network is built (labels are upper-cased and matched verbatim). Two spellings of one author merge into a single node **only if normalised before** `author_network()`. Here `"Saqr, Mohammed"` and `"SAQR M"` are the same person written two ways. After normalising they both become `SAQR M.`, so the Saqr–Lopez collaboration is correctly counted as **2**: ```{r} net <- author_network(papers, type = "collaboration") net ``` Had we built the network first and called `parse_names()` on `from` / `to` afterwards, the two spellings would already have been counted as two separate nodes — too late to merge by relabelling. ## Applying to an existing edgelist (and its hazards) The network object is a data frame (`from`, `to`, `weight`, `count`) with an extra `bibnets_network` class for printing: ```{r} class(net) is.data.frame(net) ``` You *can* relabel `from` / `to` directly, but `parse_names()` is graph-blind. Edges, pairing, `weight` and `count` are preserved, but: * Apply the **same call to both** endpoint columns, or the two ends use different labels. * The mapping is **many-to-one**: distinct authors can collapse onto one label (especially `"last_initials"`), and bibnets does **not** re-aggregate the resulting duplicate edges. ```{r} net$from <- as.vector(parse_names(net$from, format = "last")) net$to <- as.vector(parse_names(net$to, format = "last")) net ``` Use `as.vector()` when assigning back so the `"parts"` attribute is not carried on the column. ## Limitations * Comma-less names are inherently ambiguous. The `auto` heuristic is biased toward the bibnets/Scopus surname-first convention and may misread uppercase `"GIVEN SURNAME"` when the surname is 1–3 letters (e.g. `"MOHAMMED LI"`). Pass `surname_first = "no"` to override. * Suffix-first malformed input (`"Jr., Sammy Davis"`) is not specially handled. * It normalises *string form*, not identity: it will not disambiguate two different people who share a surname and initial. ## Summary * `parse_names(x)` — vector in, vector out, with a `"parts"` attribute. * `lapply(df$authors, parse_names)` — for the authors list-column. * Normalise **before** `author_network()` for correct node merging. * `format` = `"first_last"` / `"last_initials"` / `"last"`; `surname_first` = `"auto"` / `"yes"` / `"no"`.