Skip to content

NanAquarius/TAASSC

 
 

Repository files navigation

TAASSC

This repository is the landing page for code and resources for the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (version >= 2.0).

This fork (NanAquarius/TAASSC) adds a modernized, pip-installable taassc Python package that runs on modern CPython (3.10–3.13) and spaCy 3.x without changing the original tagging algorithm. The historical pub_versions/ and dev_versions/ code is preserved unchanged. See MODERNIZATION.md for the full rationale and a list of behavioural notes.

Overview of TAASSC Indices

TAASSC draws on at least four perspectives on the analysis of syntactic use:

  • Syntactic complexity via clausal subordination (e.g., Ortega, 2003; Wolfe-Quintero et al., 1998, inter alia)
  • Syntactic complexity via phrasal elaboration (e.g., Biber, 2011; Kyle & Crossley, 2018; Lu, 2010)
  • Construction grammar/Usage-based theories of language development (e.g., Ellis, 2002; Ellis & Ferreira-Junior, 2009; Goldberg, 1995, 2006, Tomasello, 2003)
  • Lexicogrammatical variation (Biber, 1988; Biber et al., 2004; Biber et al., 2011; Biber et al., 2014)

Notes on TAASSC Version 2.x

The original version of TAASSC (Kyle, 2016; Kyle & Crossley, 2017, 2018) used Stanford CoreNLP (Chen & Manning, 2014) and corpus data drawn from the Corpus of Contemporary American English (COCA; Davies, 2009). Beginning with TAASSC 2.0, Spacy (Explosion AI, 2020) was used for part of speech tagging and dependency parsing, primarily because Spacy is written in Python and some end users had difficulty installing Java dependencies for Stanford CoreNLP. Additionally, because Mark Davies does not want frequency lists from COCA distributed publicly, TAASSC 1.x could not be truly open source. Accordingly, in TAASSC 2.x corpus data was drawn from sections of the Corpus of the Web project (COW; Schäfer, 2015; Schäfer & Bildhauer, 2012).

TAASSC Versions

TAASSC 1.x: Please see information at https://www.linguisticanalysistools.org/.

TAASSC 2.0.0.58: Version used in Kyle et al. (2021) | version notes | download code


The taassc package (modernized)

A packaged version of the TAASSC 2.1.x lexicogrammatical / Biber-tag engine. The tagging logic is a faithful port of dev_versions/TAASSC 2.1.x/TAASSC_215_dev.py; the packaging makes it installable, working-directory independent, and compatible with current spaCy.

Recommended Python version

Use Python 3.11 or 3.12 (3.10 and 3.13 are also supported and tested). These have prebuilt binary wheels for spaCy/thinc/numpy, so a normal install needs no C/C++ compiler. Python 3.14+ is not yet in the support window because some dependencies do not yet publish wheels for it.

Install

# 1. create and activate a virtual environment
python3.11 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
python -m pip install --upgrade pip setuptools wheel

# 2. install the package (from a clone of this repo)
pip install .
#    ...or, for development:  pip install -e ".[dev]"

# 3. install an English spaCy model (NOT bundled — see "Models" below)
python -m spacy download en_core_web_sm
python -m spacy validate

pip install . pulls binary wheels in the common case, so no build toolchain is required. (See "Avoiding C/C++ builds" if your platform forces a source build.)

Models

Models are not package dependencies (they are large and versioned separately). TAASSC defaults to en_core_web_sm — the same model family used by the published TAASSC 2.0.0.58, and fast enough for everyday use.

Install a model in either of these ways:

python -m spacy download en_core_web_sm     # simplest
# or, pinned/reproducible:
pip install -r requirements-models.txt
# or, via the CLI helper:
taassc install-model en_core_web_sm

Choosing en_core_web_sm vs en_core_web_trf

The transformer model en_core_web_trf is more accurate but slower and heavier (it also needs pip install spacy-transformers). Select a model, in priority order, by:

  1. a function argument — taassc.LGR_Analysis(text, model="en_core_web_trf");
  2. the environment variable TAASSC_SPACY_MODEL=en_core_web_trf;
  3. the default, en_core_web_sm.

Tagging output can differ between models — treat _sm and _trf results as not directly comparable.

Usage (Python)

import taassc

# Analyze a string. The model loads lazily on this first call (not at import).
result = taassc.LGR_Analysis("They said she liked hamburgers. They also said that he didn't.")
print(result["nwords"], result["nn_all"], result["mattr"])
print(result["lemma_text"])     # pos-tagged lemmas

# Pretty-print / write the per-token annotation
taassc.print_vertical(result["tagged_text"])
taassc.output_vertical(result["tagged_text"], "out.tsv", ordered_output="full")
taassc.output_xml(result["tagged_text"], "out.xml")

# Summary spreadsheet for a folder (or a list of filenames) of .txt files
taassc.LGR_Full("test_files/", "results.csv")
taassc.LGR_Full("test_files/", "results.csv", output=["xml", "vertical"])

# Recalculate indices from a folder of fix-tagged XML files
import glob
taassc.lgrXml(glob.glob("xml_output/*.xml"), "xml_test.csv")

The public API (LGR_Analysis, LGR_Full, lgrXml, print_vertical, output_vertical, output_xml, and the XML readers) keeps the same names and call signatures as TAASSC_215_dev.py; LGR_Analysis/LGR_Full only gained optional trailing model=/nlp= keyword arguments.

Usage (command line)

taassc analyze input.txt --output results.csv      # one file -> summary CSV
taassc analyze input.txt --xml out.xml             # one file -> annotated XML
taassc analyze input.txt --vertical out.tsv        # one file -> vertical TSV
taassc analyze input.txt                           # print a short summary
taassc analyze-folder test_files/ --output results.csv
taassc analyze-folder test_files/ --output results.csv --xml --vertical
taassc --version

Add --model en_core_web_trf to any analysis command to switch models.

Avoiding C/C++ builds

Normally pip install downloads prebuilt wheels and no compiler is needed. You only hit a source build (and thus need build tools) when your Python version or platform has no matching wheel for spaCy/thinc/numpy. If that happens, prefer to switch to Python 3.11 or 3.12 rather than installing a compiler — that is almost always the faster fix. On Windows in particular, stick to a supported Python version to avoid needing the MSVC build tools.

Research reproducibility

The published TAASSC results were produced with older software (TAASSC 2.0.0.58, Python 3.7.3, spaCy 2.1.8, en_core_web_sm 2.1.0). This modernized package targets current Python/spaCy and is intended for maintenance and new research. Because spaCy's models and parser have changed across major versions, numbers are not guaranteed to be identical across versions — compare results across spaCy/model versions with care, and always report the versions you used. The bundled golden test snapshots are pinned to spaCy 3.8.x + en_core_web_sm 3.8.0. See MODERNIZATION.md for details.

Tests

pip install -e ".[dev]"
python -m spacy download en_core_web_sm
python -m spacy validate
python -m pytest

Model-dependent tests are skipped automatically if no model is installed; the byte-exact golden tests are skipped unless the installed spaCy/model versions match the snapshot versions.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

Tool for the Automatic Analysis of Syntactic Sophistication and Complexity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%