Skip to content

MyWebIntelligence/mwiR

Repository files navigation

COMPONENT REPOSITORY (R ANALYSIS BRIDGE). MAIN TOOL AND DATA PIPELINE: https://github.com/MyWebIntelligence/mwi

mwiR (R package for My Web Intelligence)

Status: stable as an analysis bridge (imports/analysis). Not the entry point.

If you are looking for installation and end-to-end usage, start here: https://github.com/MyWebIntelligence/mwi

Purpose of My Web Intelligence

Context and Objectives

My Web Intelligence (MWI) is a project designed to meet the growing need for tools and methodologies in the field of digital methods in social sciences and information and communication sciences (ICS). The main objective is to map the digital ecosystem to identify key actors, assess their influence, and analyze their discourses and interactions. This project addresses the increasing centrality of digital information and interactions in various fields, including health, politics, culture, and beyond.

About the Author

MyWebIntelligence Banner

Amar LAKEL

Amar Lakel is a researcher in information and communication sciences, specializing in digital methods applied to social studies. He is currently a member of the MICA laboratory (Mediation, Information, Communication, Arts) at the University of Bordeaux Montaigne. His work focuses on the analysis of online discourse, mapping digital ecosystems, and the impact of digital technologies on social and cultural practices.

Online Profiles

Methodology

Research Protocol

The research protocol of MWI relies on a combination of quantitative and qualitative methods:

  1. Data Extraction and Archiving: Using crawl technologies to collect data from the web.
  2. Data Qualification and Annotation: Applying algorithms to analyze, classify, and annotate the data.
  3. Data Visualization: Developing dashboards and relational maps to interpret the results.

Methodological Challenges

The MWI project utilizes techniques from the sociology of controversies, social network analysis, and text mining methods to:

  • Analyze the strategic positions of speakers in a heterogeneous and complex digital corpus.
  • Identify and understand the dynamics of online discourses.
  • Map the relationships between different actors and their respective influences.

Case Studies

Diverse Cases

  1. Health Information
  • Asthma and Diabetes in Children: Studies of online discourses related to these diseases to identify influential actors, understand their positions, and evaluate their impact on patients’ perceptions and behaviors. Source
  1. Online Political Controversy
  • Juan Branco Project: Analysis of discourses and influence surrounding the public figure Juan Branco, exploring the dynamics of positioning and controversy. Source
  1. Research Sociology
  • Digital Humanities: Studies on the impact of digital technologies on humanities and social sciences, including how researchers use the web to disseminate and discuss their work. Source

Results and Impact

The results of these studies show that online discourses play a crucial role in shaping opinions and behaviors in various fields. They also highlight the importance for researchers and professionals to actively engage in these discussions to promote reliable and scientifically validated information.

Repositories and Documentation

NAKALA Repositories The data and results of the MWI project are deposited on the NAKALA platform, providing open access for other researchers and practitioners. Here are some important repositories:

  1. The collection: Contains a detailed description of the project, methodology, and results.
  2. Positions and Influences on the Web: The Case of Health Information: Detailed analysis of discourses on childhood asthma.
  3. French Digital Humanities communities: A study case on French digital humanities development on the web.

Development of the R Package

The R package developed within the framework of My Web Intelligence is designed to: - Facilitate the replication of analyses conducted in the project. - Enable the extension of developed methods and tools for other research. - Provide researchers and professionals with a powerful tool to understand and manage the dynamics of online information.

Main Features

  • Project Management: Tools to initiate and manage web exploration projects.
  • Data Extraction: Functions to crawl the web and extract data corpora.
  • Analysis and Annotation: Algorithms to analyze and annotate extracted data.
  • Visualization: Dashboards and maps to visualize relationships between actors and discourses.

Conclusion

My Web Intelligence is an integrative project aimed at transforming how we understand and analyze digital information across various fields in social sciences and ICS. By combining innovative methodologies and advanced technological tools, MWI offers new perspectives on digital dynamics and proposes solutions to better understand online interactions and discourses. The R package developed from this project is an essential tool for researchers and practitioners, enabling them to fully exploit web data for in-depth and relevant analyses.

Using mwiR: A Step-by-Step Guide

Prerequisites (Before Installing mwiR)

Before installing mwiR, you need to set up your system with R and Python. Follow the instructions for your operating system.

1. Install R (Required)

Windows:

  1. Go to https://cran.r-project.org/bin/windows/base/
  2. Download the latest R installer (e.g., R-4.4.1-win.exe)
  3. Run the installer, accept all defaults
  4. Verify: Open Command Prompt and type R --version

macOS:

  1. Go to https://cran.r-project.org/bin/macosx/
  2. Download the .pkg file for your Mac (Intel or Apple Silicon)
  3. Double-click to install
  4. Verify: Open Terminal and type R --version

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install r-base r-base-dev

2. Install RStudio (Recommended)

RStudio provides a user-friendly interface for R.

  1. Go to https://posit.co/download/rstudio-desktop/
  2. Download the installer for your OS
  3. Install and launch RStudio

3. Install Python 3 (Required for Web Crawling)

mwiR uses Python's trafilatura library for web content extraction. Python 3.8 or higher is required.

Windows (Important - Read Carefully):

  1. Go to https://www.python.org/downloads/windows/
  2. Download "Windows installer (64-bit)" for Python 3.11 or 3.12
  3. CRITICAL: During installation, check ✅ "Add Python to PATH"
  4. Click "Install Now"
  5. Verify installation:
    • Open Command Prompt (cmd)
    • Type: python --version → Should show Python 3.x.x
    • Type: pip --version → Should show pip version

If python command not found on Windows:

  • The command might be python3 or py instead
  • Or reinstall Python with "Add to PATH" checked

macOS:

# Check if Python 3 is installed
python3 --version

# If not installed, use Homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python3

# Verify
python3 --version
pip3 --version

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install python3 python3-pip python3-venv
python3 --version

4. Install Git (Required for Package Installation)

Windows:

  1. Go to https://git-scm.com/download/win
  2. Download and run the installer
  3. Accept all defaults (important: keep "Git from command line" selected)
  4. Verify: Open Command Prompt and type git --version

macOS:

# Git is usually pre-installed. If not:
xcode-select --install
# Or with Homebrew:
brew install git

Linux:

sudo apt install git

5. Verify Your Setup

Open R or RStudio and run these commands to verify everything is ready:

# Check R version (should be >= 4.0)
R.version.string

# Check if Python is accessible from R
Sys.which("python3")  # macOS/Linux
Sys.which("python")   # Windows

Installation of mwiR

Once prerequisites are installed, open RStudio (or R console) and run:

# Step 1: Install the remotes package (if not already installed)
install.packages("remotes")

# Step 2: Install mwiR from GitHub
## Désactiver temporairement le PAT pour cette session
Sys.unsetenv("GITHUB_PAT")
remotes::install_github("MyWebIntelligence/mwiR")

# Step 3: Load the package
library(mwiR)

Alternative installation method (if GitHub access issues):

# Using install_git with full URL
remotes::install_git("https://github.com/MyWebIntelligence/mwiR.git")

Note: The installation may take a few minutes as it downloads and installs all dependencies.


Python/Trafilatura Setup (Automatic)

Trafilatura is the Python library used for web content extraction. mwiR handles this automatically - you don't need to install it manually.

When you run initmwi() for the first time:

  1. mwiR creates a dedicated Python virtual environment (isolated from your system)
  2. Trafilatura is automatically installed in this environment
  3. The setup is cached, so subsequent sessions start instantly

Troubleshooting Python Setup

# Check current Python/trafilatura status
check_python_status()

# If problems, force reinstall
setup_python(force = TRUE)

# Complete reset (if all else fails)
remove_python_env()
setup_python()

Common Issues and Solutions

Problem Solution
"Python not found" Verify Python is in PATH (see prerequisites)
"pip not found" Reinstall Python with pip included
"Permission denied" (Windows) Run RStudio as Administrator
"venv module not found" (Linux) sudo apt install python3-venv
trafilatura errors Run setup_python(force = TRUE)

API Keys Setup (Optional)

mwiR can integrate with external services that require API keys:

SerpAPI (for Google/Bing/DuckDuckGo search)

  1. Create an account at https://serpapi.com/
  2. Get your API key from the dashboard
  3. In R, when prompted or set manually:

Option 1: Set temporarily in R session

Sys.setenv(SERPAPI_KEY = "your_api_key_here")

This works but you'll need to run it each time you start R.

Option 2: Save permanently in .Renviron file (Recommended)

The .Renviron file stores environment variables that R loads automatically at startup.

How to create/edit .Renviron:

Method A - Using R (easiest):

# This opens the file in your default editor
usethis::edit_r_environ()

# Or create it manually:
file.edit("~/.Renviron")

Method B - Manual creation:

OS File location
Windows C:\Users\YourUsername\Documents\.Renviron
macOS /Users/YourUsername/.Renviron
Linux /home/YourUsername/.Renviron

Windows step-by-step:

  1. Open File Explorer
  2. Go to C:\Users\YourUsername\Documents\
  3. Create a new text file
  4. Rename it to .Renviron (no extension, just .Renviron)
  5. If Windows complains about no extension, confirm "Yes"
  6. Open with Notepad and add your keys

macOS/Linux step-by-step:

# Open Terminal and run:
nano ~/.Renviron
# Add your keys, then Ctrl+O to save, Ctrl+X to exit

Content of .Renviron file:

SERPAPI_KEY=your_serpapi_key_here
OPENAI_API_KEY=your_openai_key_here
OPENROUTER_API_KEY=your_openrouter_key_here

Important rules for .Renviron:

  • One variable per line
  • No spaces around =
  • No quotes around values
  • Add an empty line at the end of the file
  • Restart R/RStudio after editing for changes to take effect

Verify your keys are loaded:

# After restarting R:
Sys.getenv("SERPAPI_KEY")      # Should show your key
Sys.getenv("OPENAI_API_KEY")   # Should show your key

OpenAI/OpenRouter (for AI-assisted recoding)

Add these to your .Renviron file (see instructions above):

OPENAI_API_KEY=sk-your-openai-key
OPENROUTER_API_KEY=sk-or-your-openrouter-key

Note: If you don't have API keys yet, you can skip this step. You can still use mwiR's crawling and analysis features without them.

Step 1: Creating the Research Project

In this step-by-step guide, we will walk through the initial setup and execution of a research project using the My Web Intelligence (MWI) method. This method allows researchers to analyze the impact of various factors, such as AI on work, by collecting and organizing web data. Here is a breakdown of the R script provided:

1. Load the Required Packages

initmwi()

The initmwi() function initializes the My Web Intelligence environment by loading all necessary packages and setting up the environment for further operations. This function ensures that all dependencies and configurations are correctly initialized.

2. Set Up the Database

db_setup()

The db_setup() function sets up the database needed for storing and managing the data collected during the research project. It initializes the necessary database schema and ensures that the database is ready for data insertion and retrieval.

  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

3. Create a Research Project (Land)

create_land(name = "AIWork", desc = "Impact of AI on work", lang="en")

The create_land() function creates a new research project, referred to as a “land” in MWI terminology. This land will serve as the container for all data and analyses related to the project.

  • name: A string specifying the name of the land.
  • desc: A string providing a description of the land.
  • lang: A string specifying the language of the land. Default is "en".
  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

4. Add Search Terms

addterm("AIWork", "AI, artificial intelligence, work, employment, job, profession, labor market")

The addterm() function adds search terms to the project. These terms will be used to crawl and collect relevant web data.

  • land_name: A string specifying the name of the land.
  • terms: A comma-separated string of terms to add.

5. Verify the Project Creation

listlands("AIWork")

The listlands() function lists all lands or projects that have been created. By specifying the project name “AIWork”, it verifies that the project has been successfully created.

  • land_name: A string specifying the name of the land to list. If NULL, all lands are listed. Default is NULL.
  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

6. Add URLs Manually or Using a File

addurl("AIWork", urls = "https://www.fr.adp.com/rhinfo/articles/2022/11/la-disparition-de-certains-metiers-est-elle-a-craindre.aspx")

The addurl() function adds URLs to the project. These URLs point to web pages that contain relevant information for the research.

  • land_name: A string specifying the name of the land.
  • urls: A comma-separated string of URLs to add. Default is NULL.
  • path: A string specifying the path to a file containing URLs. Default is NULL.
  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

Alternatively, URLs can be added using a text file:

# If using a text file

addurl("AIWork", path = "_ai_or_artificial_intelligence___work_or_employment_or_job_or_profession_or_labor_market01.txt")
  • path: The path to a text file containing the URLs to be added.

7. List the Projects or a Specific Project

listlands("AIWork")

This function is used again to list the projects or a specific project, ensuring that the URLs have been added correctly to “AIWork”.

8. Optionally Delete a Project

deleteland(land_name = "AIWork")

The deleteland() function deletes a specified project. This can be useful for cleaning up after the research is completed or if a project needs to be restarted.

  • land_name: A string specifying the name of the land to delete.
  • maxrel: An integer specifying the maximum relevance for expressions to delete. Default is NULL.
  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

This script demonstrates the basic setup and execution of a research project using My Web Intelligence, including project creation, term addition, URL management, and project verification.

Step 2: Crawling

In this section, we will walk through the process of crawling URLs and extracting content for analysis using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.

Crawl URLs for a Specific Land

crawlurls("AIWork", limit = 10)

The crawlurls() function crawls URLs for a specific land, updates the database, and calculates relevance scores.

  • land_name: A character string representing the name of the land.
  • urlmax: An integer specifying the maximum number of URLs to be processed (default is 50).
  • limit: An optional integer specifying the limit on the number of URLs to crawl.
  • http_status: An optional character string specifying the HTTP status to filter URLs.
  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

Example:

This example demonstrates crawling up to 10 URLs for the land named “IATravail”.

crawlurls("AIWork", limit = 10)

Crawl Domains

crawlDomain(1000)

The crawlDomain() function crawls domains and updates the Domain table with the fetched data.

  • nburl: An integer specifying the number of URLs to be crawled (default is 100).
  • db_name: A string specifying the name of the SQLite database file. Default is "mwi.db".

Example:

This example demonstrates crawling 1000 URLs and updating the Domain table.

crawlDomain(1000)

Step 3: Export Files and Corpora

In this section, we will walk through the process of exporting data and corpora from a research project using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.

Export Land Data

The export_land() function exports your research data in various formats.

Parameters:

  • land_name: Name of the land/project
  • export_type: Format to export (see below)
  • minimum_relevance: Minimum relevance score (default: 1)
  • labase: Database file (default: "mwi.db")
  • ext: For corpus export only - "md", "txt", or "pdf" (default: "md")

Available export types:

Type Description Output
pagecsv Basic page data CSV file
fullpagecsv Complete page data with content CSV file
nodecsv Domain nodes for network analysis CSV file
nodegexf Domain network graph GEXF file (Gephi)
pagegexf Page-level network graph GEXF file (Gephi)
mediacsv Media/images extracted CSV file
corpus Text corpus with metadata ZIP of md/txt/pdf files

Examples:

# Export page data as CSV (minimum relevance = 3)
export_land("AIWork", "pagecsv", 3)

# Export full content as CSV
export_land("AIWork", "fullpagecsv", 2)

# Export domain network for Gephi
export_land("AIWork", "nodegexf", 3)

# Export page network for Gephi
export_land("AIWork", "pagegexf", 3)

# Export text corpus as Markdown files
export_land("AIWork", "corpus", 3, ext = "md")

# Export text corpus as PDF files
export_land("AIWork", "corpus", 3, ext = "pdf")

Step 4: Enrich Your Corpus with SerpAPI Helpers

Once the foundational land is in place, the next objective is to broaden your web perimeter. The package provides dedicated helpers around SerpAPI so you can script keyword expansion and SERP harvesting before every crawl.

1. Discover Related Queries

related_query("intelligence artificielle", lang = "fr", country = "France")

related_query() returns the "People also search for" block as a tidy data frame. Typical workflow: collect the suggestions, inspect them quickly in R, fold the most relevant ones back into addterm(), and archive the CSV for methodological transparency.

Common language codes (lang parameter):

Code Language Code Language
en English it Italian
fr French pt Portuguese
de German pt-br Portuguese (Brazil)
es Spanish nl Dutch
es-419 Spanish (Latin America) pl Polish
ar Arabic ru Russian
zh-cn Chinese (Simplified) ja Japanese
zh-tw Chinese (Traditional) ko Korean

Common country values (country parameter):

Country Country Country
France Germany Spain
United States United Kingdom Canada
Belgium Switzerland Italy
Brazil Mexico Argentina
Japan Australia Netherlands

Usage examples:

# French search in France
related_query("intelligence artificielle", lang = "fr", country = "France")

# English search in United States
related_query("artificial intelligence", lang = "en", country = "United States")

# Spanish search in Spain
related_query("inteligencia artificial", lang = "es", country = "Spain")

# German search in Germany
related_query("künstliche Intelligenz", lang = "de", country = "Germany")

# Portuguese search in Brazil
related_query("inteligência artificial", lang = "pt-br", country = "Brazil")

Full list available at: SerpAPI Languages and SerpAPI Countries

2. Capture Google, DuckDuckGo, and Bing Result Lists

urlist_Google(
  query = "ai OR artificial intelligence",
  datestart = "2024-01-01",
  dateend   = "2024-03-31",
  timestep  = "month",
  sleep_seconds = 2,
  lang = "en"
)

urlist_Google(), urlist_Duck(), and urlist_Bing() paginate SERP responses and write raw URL dumps on disk (one file per query). You can then read those files back with importFile() and feed them to addurl(). Remember to space requests (sleep_seconds) to stay inside rate limits.

3. Monitor SEO Signals

mwir_seorank(
  filename = "aiwatch_seo",
  urls     = c("example.com", "opencorpus.org"),
  api_key  = Sys.getenv("SEORANK_API_KEY")
)

mwir_seorank() queries the SEO Rank API for MOZ/PageSpeed style indicators. Because the function appends rows as soon as a response arrives, you can launch it overnight on dozens of domains and obtain a ready-to-share CSV.

Step 5: Transform and Diagnose Numeric Variables

When the time comes to model or discretise quantitative indicators (e.g., in-degree, frequency, sentiment scores), the package offers statistical helpers inspired by social-science workflows.

1. Explore Transformations Visually

plotlog(
  df         = analytics,
  variables  = c("in_degree", "reach"),
  trans_type = c(in_degree = "log1p", reach = "zscore"),
  save       = TRUE
)

plotlog() overlays the original and transformed histograms so you can compare scales immediately. Main arguments and expected inputs:

  • df — data frame passed to the function. If you leave variables = NULL, every numeric column in df is analysed.
  • variables — character vector that specifies the columns to plot. You can supply a named vector or list so each variable receives its own transformation rule.
  • trans_type — transformation applied to each series. Recognised keywords: "none", "log", "log1p", "sqrt", "rank", "zscore". Provide a single value to reuse it everywhere, a named vector/list to mix them, or a custom function returning a numeric vector.
  • bins — histogram resolution. Accept an integer (e.g. 30) or one of the standard rules: "sturges" (default), "fd"/"freedman-diaconis", "scott", "sqrt", "rice", "doane", "auto" (maximum of Sturges and F-D).
  • colors, alpha — choose the two fill colours (original vs transformed) and set the transparency level between 0 and 1.
  • theme — any ggplot2 theme object (theme_minimal() by default).
  • density, show_rug — booleans that toggle a kernel density overlay and a rug showing individual points.
  • na_rm, min_non_missing — control filtering. na_rm = TRUE drops non-finite values before plotting; min_non_missing (default 5) is the minimum number of finite values required for a variable to be plotted.
  • shift_constant — positive offset automatically added before log/sqrt transformations when the data contains zero or negative values (default 1).
  • displayTRUE prints the combined panel to the current graphics device; set to FALSE to return the object silently.
  • save — set to TRUE to export the plots. Use with save_dir (folder), save_format ("png", "pdf", or "jpg"), save_dpi, device_width, and device_height to control the files that are written.
  • verbose — produces progress messages when TRUE (defaults to interactive()).

2. Apply Robust Transformations Programmatically

scaled <- transform_variable(
  x         = analytics$reach,
  method    = "yeojohnson",
  winsorize = 0.01
)

transform_variable() stores both the transformed values and the inverse mapping. This makes it easy to export model-ready columns while keeping de-standardisation metadata.

  • x — numeric vector to transform (NA/Inf allowed; non-finite entries propagate).
  • method — transformation choice: "auto" (bestNormalize search), "none", "center", "zscore", "robust_z", "log", "log1p", "sqrt", "boxcox", "yeojohnson", "ranknorm", or a user-supplied function.
  • winsorize — optional share of tails to trim before transforming (0 ≤ value < 0.5). Use NULL to skip.
  • shift_constant — positive constant automatically added before log/sqrt transforms when x contains non-positive values (default 1).
  • handle_na — choose "keep" (default) to leave NA in place or "omit" to drop them before fitting the transform.
  • ... — forwarded to bestNormalize helpers (e.g. Box-Cox tweaks) when the selected method requires it.

3. Segment Indicators into Meaningful Classes

clusters <- find_clusters(
  analytics$reach,
  max_G         = 5,
  transform     = "auto",
  winsorize     = 0.01,
  return_breaks = TRUE
)

classes <- discretize_variable(
  analytics$reach,
  method = "manual",
  breaks = clusters$breaks,
  labels = c("Faible", "Moyen", "Élevé")
)
  • find_clusters() ajuste des mélanges gaussiens 1D pour révéler des typologies. Paramètres essentiels : max_G (nombre de composantes), criterion ("bic" ou "icl"), transform ("none", "log1p", "yeojohnson", "zscore", "auto") et winsorize (0–0.5). Avec return_breaks = TRUE, la fonction fournit des bornes prêtes à l’emploi et expose la classification, les probabilités posterior, n_clusters et plusieurs diagnostics.
  • discretize_variable() transforme ensuite la mesure continue en classes interprétables. Les méthodes disponibles ("equal_freq", "equal_width", "quantile", "jenks", "kmeans", "gmm", "manual") couvrent la plupart des scénarios. En mode "manual", fournissez vos breaks (ceux du clustering, par exemple) et des labels parlants. Le facteur retourné reste ordonné et conserve un attribut discretize_meta (bornes, effectifs, avertissements).

4. Examine Heavy-Tailed Behaviours

powerlaw <- analyse_powerlaw(
  analytics$reach,
  type             = "discrete",
  candidate_models = c("powerlaw", "lognormal", "exponential"),
  bootstrap_sims   = 200,
  winsorize        = 0.01,
  threads          = NULL  # Auto-detection (recommended)
)
  • analyse_powerlaw() confronte plusieurs lois de queue pour tester la présence d'une véritable loi de puissance.
  • En mode "discrete", les données sont arrondies et les valeurs < 1 exclues ; vérifie qu'il reste assez d'observations positives (min_n = 50 par défaut).
  • Ajuste type, candidate_models, winsorize, xmin et threads selon tes besoins de robustesse et de temps de calcul.
  • Parallélisation automatique : threads = NULL (défaut) détecte automatiquement le nombre optimal de cœurs :
    • Sur Apple Silicon (M1/M2/M3), utilise uniquement les Performance cores (P-cores) pour éviter les E-cores plus lents
    • Sur autres systèmes, utilise la moitié des cœurs disponibles moins 1
    • Utilise mwir_system_info() pour voir la configuration détectée
  • candidate_models accepte "powerlaw", "lognormal", "exponential" (et "weibull" en continu). Tu peux fournir un sous-ensemble ciblé ou changer l'ordre selon les lois pertinentes pour ton terrain.
  • bootstrap_sims contrôle le nombre de simulations KS, bootstrap_models restreint la liste des modèles simulés. Diminue bootstrap_sims pour un résultat rapide, augmente-le pour plus de précision.
  • La sortie regroupe best_model, les paramètres (best_fit), les comparaisons de vraisemblance (comparisons), les diagnostics bootstrap (bootstrap) et un data_summary directement mobilisable dans les rapports.
  • Bonnes pratiques : essayer plusieurs winsorize, surveiller best_fit$n_tail, examiner les p-values bootstrap et justifier le xmin retenu.

5. System Information for Parallelization

# Display system info and recommended worker count
mwir_system_info()

# Example output on Apple Silicon Mac:
# === mwiR System Info ===
# OS: Darwin
# Architecture: arm64
# Total cores: 10
# Performance cores: 8
# Efficiency cores: 2
# Recommended workers: 7
# Apple Silicon: Yes

mwir_system_info() displays hardware detection results and the recommended number of parallel workers. Useful for:

  • Verifying Apple Silicon detection
  • Understanding why threads = NULL chose a specific worker count
  • Troubleshooting parallelization issues

Step 8: Leverage AI Assistance Responsibly

Large Language Models can speed up qualitative coding, but they demand guardrails. The unified LLM_Recode() function supports multiple providers (OpenAI, OpenRouter, Anthropic, Ollama) with a simple, consistent interface.

Configuration (once per session)

# Configure your preferred provider
LLM_Config(provider = "openai", lang = "fr")

# Or set API keys directly
Sys.setenv(OPENAI_API_KEY = "sk-...")
Sys.setenv(OPENROUTER_API_KEY = "orpk-...")
Sys.setenv(ANTHROPIC_API_KEY = "sk-ant-...")

Basic Usage with Vectors

# Simple translation of a vector
translations <- LLM_Recode(
  data        = c("Hello world", "Automation and labour markets"),
  prompt      = "Traduire en français: {value}",
  temperature = 0.4
)

Advanced Usage with Data Frames

# Process multiple columns using glue templates
results <- LLM_Recode(
  data   = my_dataframe,
  prompt = "Résumer en 20 mots: {title} - {content}",
  provider        = "openrouter",
  model           = "openrouter/auto",
  temperature     = 0.2,
  max_tokens      = 80,
  return_metadata = TRUE,
  parallel        = TRUE,       # Enable parallel processing
  workers         = 4
)

# Check failures
failed <- results[results$status != "ok", ]

Entrées de LLM_Recode()

  • data : vecteur ou data.frame à traiter (obligatoire).
  • prompt : template glue avec placeholders {variable} (obligatoire). Pour les vecteurs, utiliser {value}.
  • provider : "openai", "openrouter", "anthropic", ou "ollama" (auto-détecté si omis).
  • model : identifiant du modèle (défaut selon provider).
  • temperature (0–2) : aléa de génération.
  • max_tokens : plafond de tokens en sortie.
  • max_retries, retry_delay, backoff_multiplier : stratégie de retry avec backoff exponentiel.
  • rate_limit_delay : délai entre requêtes pour respecter les quotas.
  • parallel, workers : traitement parallèle via future/furrr.
  • return_metadata : retourne un data.frame avec statut, tentatives, tokens utilisés.
  • on_error : "continue" (défaut) ou "stop" sur la première erreur.

Sortie

  • Par défaut, un vecteur de chaînes recodées (NA en cas d'échec).
  • Avec return_metadata = TRUE, un data.frame avec colonnes value, status, attempts, tokens_used, model_used, error_message.

LLM_Config() - Configuration de session

LLM_Config(
  provider = "openai",    # Provider par défaut
  model    = "gpt-4o",    # Modèle par défaut
  lang     = "fr",        # Langue des messages (fr/en)
  verbose  = TRUE         # Afficher les messages de progression
)

Bonnes pratiques

  • Documenter prompt, sysprompt, modèle et version dans votre carnet de labo.
  • Baisser temperature pour des traductions fidèles, l'augmenter pour des reformulations créatives.
  • Limiter max_tokens pour garder des réponses concises.
  • Utiliser return_metadata = TRUE pour auditer les résultats.
  • Activer parallel = TRUE avec modération (respecter les rate limits des APIs).

Step 9: Maintain the Database Throughout the Project Lifecycle

The database layer underpins every land. The following helpers keep it healthy and synchronised with external edits.

1. Connect Programmatically and Reuse IDs

con      <- connect_db()
land_id  <- get_land_id(con, "AIWork")
domaines <- list_domain(con, land_name = "AIWork")
  • connect_db() returns a ready-to-use DBI connection.
  • get_land_id() converts human-readable land names into numeric IDs when you automate workflows.
  • list_domain() produces a domain summary (counts, keywords) to monitor coverage.

2. Import Additional Material

urls <- importFile()
addurl("AIWork", urls = urls$url)

Use importFile() whenever you enrich your corpus from spreadsheets or open postings. The helper returns a data frame; pass the relevant column to addurl().

3. Reinstate Externally Annotated Data

annotatedData(
  dataplus = curated_notes,
  table    = "Expression",
  champ    = "description",
  by       = "id"
)

annotatedData() wraps transactional updates so a batch edit either fully succeeds or rolls back. Always back up mwi.db before bulk reinsertion.

4. Export Precisely What You Need

Beyond export_land(), the family of dedicated exporters gives you fine-grained control:

  • export_pagecsv() and export_fullpagecsv() to share tabular corpora.
  • export_nodecsv() / export_nodegexf() for network analysis.
  • export_mediacsv() to audit associated media.
  • export_pagegexf() for expression-level graphs.
  • export_corpus() to assemble text files plus metadata headers (ideal for CAQDAS tools). Supports multiple formats: ext = "md" (Markdown, default), ext = "txt" (plain text), ext = "pdf" (PDF, requires rmarkdown + pandoc).

Each exporter accepts minimum_relevance, so you can balance breadth and focus depending on the audience.

5. Technical Notes

User-Agent Rotation: The crawler uses a pool of 10 realistic browser User-Agents (Chrome, Firefox, Safari, Edge) randomly selected for each request to reduce 403 errors from anti-scraping systems.

SERP Metadata Preservation: When importing URLs from urlist_Google(), urlist_DuckDuckGo(), or urlist_Bing() via addurl(), the title and publication date from search results are preserved. During crawling, these values are not overwritten, ensuring SERP metadata integrity.

Archive.org Fallback: If a website blocks the crawler (403/503 errors), the system automatically attempts to fetch content from Archive.org's Wayback Machine.

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages