This repository is the landing page for code and resources for the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (version >= 2.0).
This fork (
NanAquarius/TAASSC) adds a modernized, pip-installabletaasscPython package that runs on modern CPython (3.10–3.13) and spaCy 3.x without changing the original tagging algorithm. The historicalpub_versions/anddev_versions/code is preserved unchanged. See MODERNIZATION.md for the full rationale and a list of behavioural notes.
TAASSC draws on at least four perspectives on the analysis of syntactic use:
- Syntactic complexity via clausal subordination (e.g., Ortega, 2003; Wolfe-Quintero et al., 1998, inter alia)
- Syntactic complexity via phrasal elaboration (e.g., Biber, 2011; Kyle & Crossley, 2018; Lu, 2010)
- Construction grammar/Usage-based theories of language development (e.g., Ellis, 2002; Ellis & Ferreira-Junior, 2009; Goldberg, 1995, 2006, Tomasello, 2003)
- Lexicogrammatical variation (Biber, 1988; Biber et al., 2004; Biber et al., 2011; Biber et al., 2014)
The original version of TAASSC (Kyle, 2016; Kyle & Crossley, 2017, 2018) used Stanford CoreNLP (Chen & Manning, 2014) and corpus data drawn from the Corpus of Contemporary American English (COCA; Davies, 2009). Beginning with TAASSC 2.0, Spacy (Explosion AI, 2020) was used for part of speech tagging and dependency parsing, primarily because Spacy is written in Python and some end users had difficulty installing Java dependencies for Stanford CoreNLP. Additionally, because Mark Davies does not want frequency lists from COCA distributed publicly, TAASSC 1.x could not be truly open source. Accordingly, in TAASSC 2.x corpus data was drawn from sections of the Corpus of the Web project (COW; Schäfer, 2015; Schäfer & Bildhauer, 2012).
TAASSC 1.x: Please see information at https://www.linguisticanalysistools.org/.
TAASSC 2.0.0.58: Version used in Kyle et al. (2021) | version notes | download code
A packaged version of the TAASSC 2.1.x lexicogrammatical / Biber-tag engine. The
tagging logic is a faithful port of dev_versions/TAASSC 2.1.x/TAASSC_215_dev.py;
the packaging makes it installable, working-directory independent, and compatible
with current spaCy.
Use Python 3.11 or 3.12 (3.10 and 3.13 are also supported and tested). These have prebuilt binary wheels for spaCy/thinc/numpy, so a normal install needs no C/C++ compiler. Python 3.14+ is not yet in the support window because some dependencies do not yet publish wheels for it.
# 1. create and activate a virtual environment
python3.11 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
python -m pip install --upgrade pip setuptools wheel
# 2. install the package (from a clone of this repo)
pip install .
# ...or, for development: pip install -e ".[dev]"
# 3. install an English spaCy model (NOT bundled — see "Models" below)
python -m spacy download en_core_web_sm
python -m spacy validatepip install . pulls binary wheels in the common case, so no build toolchain is
required. (See "Avoiding C/C++ builds" if your platform forces a source build.)
Models are not package dependencies (they are large and versioned separately).
TAASSC defaults to en_core_web_sm — the same model family used by the
published TAASSC 2.0.0.58, and fast enough for everyday use.
Install a model in either of these ways:
python -m spacy download en_core_web_sm # simplest
# or, pinned/reproducible:
pip install -r requirements-models.txt
# or, via the CLI helper:
taassc install-model en_core_web_smThe transformer model en_core_web_trf is more accurate but slower and heavier
(it also needs pip install spacy-transformers). Select a model, in priority
order, by:
- a function argument —
taassc.LGR_Analysis(text, model="en_core_web_trf"); - the environment variable
TAASSC_SPACY_MODEL=en_core_web_trf; - the default,
en_core_web_sm.
Tagging output can differ between models — treat _sm and _trf results as not
directly comparable.
import taassc
# Analyze a string. The model loads lazily on this first call (not at import).
result = taassc.LGR_Analysis("They said she liked hamburgers. They also said that he didn't.")
print(result["nwords"], result["nn_all"], result["mattr"])
print(result["lemma_text"]) # pos-tagged lemmas
# Pretty-print / write the per-token annotation
taassc.print_vertical(result["tagged_text"])
taassc.output_vertical(result["tagged_text"], "out.tsv", ordered_output="full")
taassc.output_xml(result["tagged_text"], "out.xml")
# Summary spreadsheet for a folder (or a list of filenames) of .txt files
taassc.LGR_Full("test_files/", "results.csv")
taassc.LGR_Full("test_files/", "results.csv", output=["xml", "vertical"])
# Recalculate indices from a folder of fix-tagged XML files
import glob
taassc.lgrXml(glob.glob("xml_output/*.xml"), "xml_test.csv")The public API (LGR_Analysis, LGR_Full, lgrXml, print_vertical,
output_vertical, output_xml, and the XML readers) keeps the same names and
call signatures as TAASSC_215_dev.py; LGR_Analysis/LGR_Full only gained
optional trailing model=/nlp= keyword arguments.
taassc analyze input.txt --output results.csv # one file -> summary CSV
taassc analyze input.txt --xml out.xml # one file -> annotated XML
taassc analyze input.txt --vertical out.tsv # one file -> vertical TSV
taassc analyze input.txt # print a short summary
taassc analyze-folder test_files/ --output results.csv
taassc analyze-folder test_files/ --output results.csv --xml --vertical
taassc --versionAdd --model en_core_web_trf to any analysis command to switch models.
Normally pip install downloads prebuilt wheels and no compiler is needed.
You only hit a source build (and thus need build tools) when your Python version
or platform has no matching wheel for spaCy/thinc/numpy. If that happens, prefer
to switch to Python 3.11 or 3.12 rather than installing a compiler — that is
almost always the faster fix. On Windows in particular, stick to a supported
Python version to avoid needing the MSVC build tools.
The published TAASSC results were produced with older software (TAASSC 2.0.0.58,
Python 3.7.3, spaCy 2.1.8, en_core_web_sm 2.1.0). This modernized package
targets current Python/spaCy and is intended for maintenance and new
research. Because spaCy's models and parser have changed across major versions,
numbers are not guaranteed to be identical across versions — compare results
across spaCy/model versions with care, and always report the versions you used.
The bundled golden test snapshots are pinned to spaCy 3.8.x + en_core_web_sm
3.8.0. See MODERNIZATION.md for details.
pip install -e ".[dev]"
python -m spacy download en_core_web_sm
python -m spacy validate
python -m pytestModel-dependent tests are skipped automatically if no model is installed; the byte-exact golden tests are skipped unless the installed spaCy/model versions match the snapshot versions.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
