COMPONENT REPOSITORY (R ANALYSIS BRIDGE). MAIN TOOL AND DATA PIPELINE: https://github.com/MyWebIntelligence/mwi
Status: stable as an analysis bridge (imports/analysis). Not the entry point.
If you are looking for installation and end-to-end usage, start here: https://github.com/MyWebIntelligence/mwi
Context and Objectives
My Web Intelligence (MWI) is a project designed to meet the growing need for tools and methodologies in the field of digital methods in social sciences and information and communication sciences (ICS). The main objective is to map the digital ecosystem to identify key actors, assess their influence, and analyze their discourses and interactions. This project addresses the increasing centrality of digital information and interactions in various fields, including health, politics, culture, and beyond.
Amar LAKEL
Amar Lakel is a researcher in information and communication sciences, specializing in digital methods applied to social studies. He is currently a member of the MICA laboratory (Mediation, Information, Communication, Arts) at the University of Bordeaux Montaigne. His work focuses on the analysis of online discourse, mapping digital ecosystems, and the impact of digital technologies on social and cultural practices.
- MICA Labo: MICA Labo Profile
- Google Scholar: Google Scholar Profile
- ORCID: ORCID Profile
- ResearchGate: ResearchGate Profile
- Academia: Academia Profile
- Twitter: Twitter MyWebIntel Profile
- LinkedIn: LinkedIn Profile
Research Protocol
The research protocol of MWI relies on a combination of quantitative and qualitative methods:
- Data Extraction and Archiving: Using crawl technologies to collect data from the web.
- Data Qualification and Annotation: Applying algorithms to analyze, classify, and annotate the data.
- Data Visualization: Developing dashboards and relational maps to interpret the results.
Methodological Challenges
The MWI project utilizes techniques from the sociology of controversies, social network analysis, and text mining methods to:
- Analyze the strategic positions of speakers in a heterogeneous and complex digital corpus.
- Identify and understand the dynamics of online discourses.
- Map the relationships between different actors and their respective influences.
Diverse Cases
- Health Information
- Asthma and Diabetes in Children: Studies of online discourses related to these diseases to identify influential actors, understand their positions, and evaluate their impact on patients’ perceptions and behaviors. Source
- Online Political Controversy
- Juan Branco Project: Analysis of discourses and influence surrounding the public figure Juan Branco, exploring the dynamics of positioning and controversy. Source
- Research Sociology
- Digital Humanities: Studies on the impact of digital technologies on humanities and social sciences, including how researchers use the web to disseminate and discuss their work. Source
Results and Impact
The results of these studies show that online discourses play a crucial role in shaping opinions and behaviors in various fields. They also highlight the importance for researchers and professionals to actively engage in these discussions to promote reliable and scientifically validated information.
NAKALA Repositories The data and results of the MWI project are deposited on the NAKALA platform, providing open access for other researchers and practitioners. Here are some important repositories:
- The collection: Contains a detailed description of the project, methodology, and results.
- Positions and Influences on the Web: The Case of Health Information: Detailed analysis of discourses on childhood asthma.
- French Digital Humanities communities: A study case on French digital humanities development on the web.
The R package developed within the framework of My Web Intelligence is designed to: - Facilitate the replication of analyses conducted in the project. - Enable the extension of developed methods and tools for other research. - Provide researchers and professionals with a powerful tool to understand and manage the dynamics of online information.
Main Features
- Project Management: Tools to initiate and manage web exploration projects.
- Data Extraction: Functions to crawl the web and extract data corpora.
- Analysis and Annotation: Algorithms to analyze and annotate extracted data.
- Visualization: Dashboards and maps to visualize relationships between actors and discourses.
My Web Intelligence is an integrative project aimed at transforming how we understand and analyze digital information across various fields in social sciences and ICS. By combining innovative methodologies and advanced technological tools, MWI offers new perspectives on digital dynamics and proposes solutions to better understand online interactions and discourses. The R package developed from this project is an essential tool for researchers and practitioners, enabling them to fully exploit web data for in-depth and relevant analyses.
Before installing mwiR, you need to set up your system with R and Python. Follow the instructions for your operating system.
Windows:
- Go to https://cran.r-project.org/bin/windows/base/
- Download the latest R installer (e.g.,
R-4.4.1-win.exe) - Run the installer, accept all defaults
- Verify: Open Command Prompt and type
R --version
macOS:
- Go to https://cran.r-project.org/bin/macosx/
- Download the
.pkgfile for your Mac (Intel or Apple Silicon) - Double-click to install
- Verify: Open Terminal and type
R --version
Linux (Ubuntu/Debian):
sudo apt update
sudo apt install r-base r-base-devRStudio provides a user-friendly interface for R.
- Go to https://posit.co/download/rstudio-desktop/
- Download the installer for your OS
- Install and launch RStudio
mwiR uses Python's trafilatura library for web content extraction. Python 3.8 or higher is required.
Windows (Important - Read Carefully):
- Go to https://www.python.org/downloads/windows/
- Download "Windows installer (64-bit)" for Python 3.11 or 3.12
- CRITICAL: During installation, check ✅ "Add Python to PATH"
- Click "Install Now"
- Verify installation:
- Open Command Prompt (cmd)
- Type:
python --version→ Should showPython 3.x.x - Type:
pip --version→ Should show pip version
If python command not found on Windows:
- The command might be
python3orpyinstead - Or reinstall Python with "Add to PATH" checked
macOS:
# Check if Python 3 is installed
python3 --version
# If not installed, use Homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python3
# Verify
python3 --version
pip3 --versionLinux (Ubuntu/Debian):
sudo apt update
sudo apt install python3 python3-pip python3-venv
python3 --versionWindows:
- Go to https://git-scm.com/download/win
- Download and run the installer
- Accept all defaults (important: keep "Git from command line" selected)
- Verify: Open Command Prompt and type
git --version
macOS:
# Git is usually pre-installed. If not:
xcode-select --install
# Or with Homebrew:
brew install gitLinux:
sudo apt install gitOpen R or RStudio and run these commands to verify everything is ready:
# Check R version (should be >= 4.0)
R.version.string
# Check if Python is accessible from R
Sys.which("python3") # macOS/Linux
Sys.which("python") # WindowsOnce prerequisites are installed, open RStudio (or R console) and run:
# Step 1: Install the remotes package (if not already installed)
install.packages("remotes")
# Step 2: Install mwiR from GitHub
## Désactiver temporairement le PAT pour cette session
Sys.unsetenv("GITHUB_PAT")
remotes::install_github("MyWebIntelligence/mwiR")
# Step 3: Load the package
library(mwiR)Alternative installation method (if GitHub access issues):
# Using install_git with full URL
remotes::install_git("https://github.com/MyWebIntelligence/mwiR.git")Note: The installation may take a few minutes as it downloads and installs all dependencies.
Trafilatura is the Python library used for web content extraction. mwiR handles this automatically - you don't need to install it manually.
When you run initmwi() for the first time:
- mwiR creates a dedicated Python virtual environment (isolated from your system)
- Trafilatura is automatically installed in this environment
- The setup is cached, so subsequent sessions start instantly
# Check current Python/trafilatura status
check_python_status()
# If problems, force reinstall
setup_python(force = TRUE)
# Complete reset (if all else fails)
remove_python_env()
setup_python()| Problem | Solution |
|---|---|
| "Python not found" | Verify Python is in PATH (see prerequisites) |
| "pip not found" | Reinstall Python with pip included |
| "Permission denied" (Windows) | Run RStudio as Administrator |
| "venv module not found" (Linux) | sudo apt install python3-venv |
| trafilatura errors | Run setup_python(force = TRUE) |
mwiR can integrate with external services that require API keys:
- Create an account at https://serpapi.com/
- Get your API key from the dashboard
- In R, when prompted or set manually:
Option 1: Set temporarily in R session
Sys.setenv(SERPAPI_KEY = "your_api_key_here")This works but you'll need to run it each time you start R.
Option 2: Save permanently in .Renviron file (Recommended)
The .Renviron file stores environment variables that R loads automatically at startup.
How to create/edit .Renviron:
Method A - Using R (easiest):
# This opens the file in your default editor
usethis::edit_r_environ()
# Or create it manually:
file.edit("~/.Renviron")Method B - Manual creation:
| OS | File location |
|---|---|
| Windows | C:\Users\YourUsername\Documents\.Renviron |
| macOS | /Users/YourUsername/.Renviron |
| Linux | /home/YourUsername/.Renviron |
Windows step-by-step:
- Open File Explorer
- Go to
C:\Users\YourUsername\Documents\ - Create a new text file
- Rename it to
.Renviron(no extension, just.Renviron) - If Windows complains about no extension, confirm "Yes"
- Open with Notepad and add your keys
macOS/Linux step-by-step:
# Open Terminal and run:
nano ~/.Renviron
# Add your keys, then Ctrl+O to save, Ctrl+X to exitContent of .Renviron file:
SERPAPI_KEY=your_serpapi_key_here
OPENAI_API_KEY=your_openai_key_here
OPENROUTER_API_KEY=your_openrouter_key_here
Important rules for .Renviron:
- One variable per line
- No spaces around
= - No quotes around values
- Add an empty line at the end of the file
- Restart R/RStudio after editing for changes to take effect
Verify your keys are loaded:
# After restarting R:
Sys.getenv("SERPAPI_KEY") # Should show your key
Sys.getenv("OPENAI_API_KEY") # Should show your keyAdd these to your .Renviron file (see instructions above):
OPENAI_API_KEY=sk-your-openai-key
OPENROUTER_API_KEY=sk-or-your-openrouter-key
Note: If you don't have API keys yet, you can skip this step. You can still use mwiR's crawling and analysis features without them.
In this step-by-step guide, we will walk through the initial setup and execution of a research project using the My Web Intelligence (MWI) method. This method allows researchers to analyze the impact of various factors, such as AI on work, by collecting and organizing web data. Here is a breakdown of the R script provided:
initmwi()
The initmwi() function initializes the My Web Intelligence environment
by loading all necessary packages and setting up the environment for
further operations. This function ensures that all dependencies and
configurations are correctly initialized.
db_setup()
The db_setup() function sets up the database needed for storing and
managing the data collected during the research project. It initializes
the necessary database schema and ensures that the database is ready for
data insertion and retrieval.
db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
create_land(name = "AIWork", desc = "Impact of AI on work", lang="en")
The create_land() function creates a new research project, referred to
as a “land” in MWI terminology. This land will serve as the container
for all data and analyses related to the project.
name: A string specifying the name of the land.desc: A string providing a description of the land.lang: A string specifying the language of the land. Default is"en".db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
addterm("AIWork", "AI, artificial intelligence, work, employment, job, profession, labor market")
The addterm() function adds search terms to the project. These terms
will be used to crawl and collect relevant web data.
land_name: A string specifying the name of the land.terms: A comma-separated string of terms to add.
listlands("AIWork")
The listlands() function lists all lands or projects that have been
created. By specifying the project name “AIWork”, it verifies that the
project has been successfully created.
land_name: A string specifying the name of the land to list. IfNULL, all lands are listed. Default isNULL.db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
addurl("AIWork", urls = "https://www.fr.adp.com/rhinfo/articles/2022/11/la-disparition-de-certains-metiers-est-elle-a-craindre.aspx")
The addurl() function adds URLs to the project. These URLs point to
web pages that contain relevant information for the research.
land_name: A string specifying the name of the land.urls: A comma-separated string of URLs to add. Default isNULL.path: A string specifying the path to a file containing URLs. Default isNULL.db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
Alternatively, URLs can be added using a text file:
# If using a text file
addurl("AIWork", path = "_ai_or_artificial_intelligence___work_or_employment_or_job_or_profession_or_labor_market01.txt")
path: The path to a text file containing the URLs to be added.
listlands("AIWork")
This function is used again to list the projects or a specific project, ensuring that the URLs have been added correctly to “AIWork”.
deleteland(land_name = "AIWork")
The deleteland() function deletes a specified project. This can be
useful for cleaning up after the research is completed or if a project
needs to be restarted.
land_name: A string specifying the name of the land to delete.maxrel: An integer specifying the maximum relevance for expressions to delete. Default isNULL.db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
This script demonstrates the basic setup and execution of a research project using My Web Intelligence, including project creation, term addition, URL management, and project verification.
In this section, we will walk through the process of crawling URLs and extracting content for analysis using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.
crawlurls("AIWork", limit = 10)
The crawlurls() function crawls URLs for a specific land, updates the
database, and calculates relevance scores.
land_name: A character string representing the name of the land.urlmax: An integer specifying the maximum number of URLs to be processed (default is 50).limit: An optional integer specifying the limit on the number of URLs to crawl.http_status: An optional character string specifying the HTTP status to filter URLs.db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
Example:
This example demonstrates crawling up to 10 URLs for the land named “IATravail”.
crawlurls("AIWork", limit = 10)
crawlDomain(1000)
The crawlDomain() function crawls domains and updates the Domain table
with the fetched data.
nburl: An integer specifying the number of URLs to be crawled (default is 100).db_name: A string specifying the name of the SQLite database file. Default is"mwi.db".
Example:
This example demonstrates crawling 1000 URLs and updating the Domain table.
crawlDomain(1000)
In this section, we will walk through the process of exporting data and corpora from a research project using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.
The export_land() function exports your research data in various formats.
Parameters:
land_name: Name of the land/projectexport_type: Format to export (see below)minimum_relevance: Minimum relevance score (default: 1)labase: Database file (default: "mwi.db")ext: For corpus export only - "md", "txt", or "pdf" (default: "md")
Available export types:
| Type | Description | Output |
|---|---|---|
pagecsv |
Basic page data | CSV file |
fullpagecsv |
Complete page data with content | CSV file |
nodecsv |
Domain nodes for network analysis | CSV file |
nodegexf |
Domain network graph | GEXF file (Gephi) |
pagegexf |
Page-level network graph | GEXF file (Gephi) |
mediacsv |
Media/images extracted | CSV file |
corpus |
Text corpus with metadata | ZIP of md/txt/pdf files |
Examples:
# Export page data as CSV (minimum relevance = 3)
export_land("AIWork", "pagecsv", 3)
# Export full content as CSV
export_land("AIWork", "fullpagecsv", 2)
# Export domain network for Gephi
export_land("AIWork", "nodegexf", 3)
# Export page network for Gephi
export_land("AIWork", "pagegexf", 3)
# Export text corpus as Markdown files
export_land("AIWork", "corpus", 3, ext = "md")
# Export text corpus as PDF files
export_land("AIWork", "corpus", 3, ext = "pdf")Once the foundational land is in place, the next objective is to broaden your web perimeter. The package provides dedicated helpers around SerpAPI so you can script keyword expansion and SERP harvesting before every crawl.
related_query("intelligence artificielle", lang = "fr", country = "France")related_query() returns the "People also search for" block as a tidy data frame. Typical workflow: collect the suggestions, inspect them quickly in R, fold the most relevant ones back into addterm(), and archive the CSV for methodological transparency.
Common language codes (lang parameter):
| Code | Language | Code | Language |
|---|---|---|---|
en |
English | it |
Italian |
fr |
French | pt |
Portuguese |
de |
German | pt-br |
Portuguese (Brazil) |
es |
Spanish | nl |
Dutch |
es-419 |
Spanish (Latin America) | pl |
Polish |
ar |
Arabic | ru |
Russian |
zh-cn |
Chinese (Simplified) | ja |
Japanese |
zh-tw |
Chinese (Traditional) | ko |
Korean |
Common country values (country parameter):
| Country | Country | Country |
|---|---|---|
France |
Germany |
Spain |
United States |
United Kingdom |
Canada |
Belgium |
Switzerland |
Italy |
Brazil |
Mexico |
Argentina |
Japan |
Australia |
Netherlands |
Usage examples:
# French search in France
related_query("intelligence artificielle", lang = "fr", country = "France")
# English search in United States
related_query("artificial intelligence", lang = "en", country = "United States")
# Spanish search in Spain
related_query("inteligencia artificial", lang = "es", country = "Spain")
# German search in Germany
related_query("künstliche Intelligenz", lang = "de", country = "Germany")
# Portuguese search in Brazil
related_query("inteligência artificial", lang = "pt-br", country = "Brazil")Full list available at: SerpAPI Languages and SerpAPI Countries
urlist_Google(
query = "ai OR artificial intelligence",
datestart = "2024-01-01",
dateend = "2024-03-31",
timestep = "month",
sleep_seconds = 2,
lang = "en"
)urlist_Google(), urlist_Duck(), and urlist_Bing() paginate SERP responses and write raw URL dumps on disk (one file per query). You can then read those files back with importFile() and feed them to addurl(). Remember to space requests (sleep_seconds) to stay inside rate limits.
mwir_seorank(
filename = "aiwatch_seo",
urls = c("example.com", "opencorpus.org"),
api_key = Sys.getenv("SEORANK_API_KEY")
)mwir_seorank() queries the SEO Rank API for MOZ/PageSpeed style indicators. Because the function appends rows as soon as a response arrives, you can launch it overnight on dozens of domains and obtain a ready-to-share CSV.
When the time comes to model or discretise quantitative indicators (e.g., in-degree, frequency, sentiment scores), the package offers statistical helpers inspired by social-science workflows.
plotlog(
df = analytics,
variables = c("in_degree", "reach"),
trans_type = c(in_degree = "log1p", reach = "zscore"),
save = TRUE
)plotlog() overlays the original and transformed histograms so you can compare scales immediately. Main arguments and expected inputs:
df— data frame passed to the function. If you leavevariables = NULL, every numeric column indfis analysed.variables— character vector that specifies the columns to plot. You can supply a named vector or list so each variable receives its own transformation rule.trans_type— transformation applied to each series. Recognised keywords:"none","log","log1p","sqrt","rank","zscore". Provide a single value to reuse it everywhere, a named vector/list to mix them, or a custom function returning a numeric vector.bins— histogram resolution. Accept an integer (e.g.30) or one of the standard rules:"sturges"(default),"fd"/"freedman-diaconis","scott","sqrt","rice","doane","auto"(maximum of Sturges and F-D).colors,alpha— choose the two fill colours (original vs transformed) and set the transparency level between 0 and 1.theme— anyggplot2theme object (theme_minimal()by default).density,show_rug— booleans that toggle a kernel density overlay and a rug showing individual points.na_rm,min_non_missing— control filtering.na_rm = TRUEdrops non-finite values before plotting;min_non_missing(default 5) is the minimum number of finite values required for a variable to be plotted.shift_constant— positive offset automatically added before log/sqrt transformations when the data contains zero or negative values (default 1).display—TRUEprints the combined panel to the current graphics device; set toFALSEto return the object silently.save— set toTRUEto export the plots. Use withsave_dir(folder),save_format("png","pdf", or"jpg"),save_dpi,device_width, anddevice_heightto control the files that are written.verbose— produces progress messages whenTRUE(defaults tointeractive()).
scaled <- transform_variable(
x = analytics$reach,
method = "yeojohnson",
winsorize = 0.01
)transform_variable() stores both the transformed values and the inverse mapping. This makes it easy to export model-ready columns while keeping de-standardisation metadata.
x— numeric vector to transform (NA/Inf allowed; non-finite entries propagate).method— transformation choice:"auto"(bestNormalize search),"none","center","zscore","robust_z","log","log1p","sqrt","boxcox","yeojohnson","ranknorm", or a user-supplied function.winsorize— optional share of tails to trim before transforming (0 ≤ value < 0.5). UseNULLto skip.shift_constant— positive constant automatically added before log/sqrt transforms whenxcontains non-positive values (default 1).handle_na— choose"keep"(default) to leave NA in place or"omit"to drop them before fitting the transform....— forwarded tobestNormalizehelpers (e.g. Box-Cox tweaks) when the selected method requires it.
clusters <- find_clusters(
analytics$reach,
max_G = 5,
transform = "auto",
winsorize = 0.01,
return_breaks = TRUE
)
classes <- discretize_variable(
analytics$reach,
method = "manual",
breaks = clusters$breaks,
labels = c("Faible", "Moyen", "Élevé")
)find_clusters()ajuste des mélanges gaussiens 1D pour révéler des typologies. Paramètres essentiels :max_G(nombre de composantes),criterion("bic"ou"icl"),transform("none","log1p","yeojohnson","zscore","auto") etwinsorize(0–0.5). Avecreturn_breaks = TRUE, la fonction fournit des bornes prêtes à l’emploi et expose laclassification, les probabilitésposterior,n_clusterset plusieurs diagnostics.discretize_variable()transforme ensuite la mesure continue en classes interprétables. Les méthodes disponibles ("equal_freq","equal_width","quantile","jenks","kmeans","gmm","manual") couvrent la plupart des scénarios. En mode"manual", fournissez vosbreaks(ceux du clustering, par exemple) et deslabelsparlants. Le facteur retourné reste ordonné et conserve un attributdiscretize_meta(bornes, effectifs, avertissements).
powerlaw <- analyse_powerlaw(
analytics$reach,
type = "discrete",
candidate_models = c("powerlaw", "lognormal", "exponential"),
bootstrap_sims = 200,
winsorize = 0.01,
threads = NULL # Auto-detection (recommended)
)analyse_powerlaw()confronte plusieurs lois de queue pour tester la présence d'une véritable loi de puissance.- En mode
"discrete", les données sont arrondies et les valeurs < 1 exclues ; vérifie qu'il reste assez d'observations positives (min_n= 50 par défaut). - Ajuste
type,candidate_models,winsorize,xminetthreadsselon tes besoins de robustesse et de temps de calcul. - Parallélisation automatique :
threads = NULL(défaut) détecte automatiquement le nombre optimal de cœurs :- Sur Apple Silicon (M1/M2/M3), utilise uniquement les Performance cores (P-cores) pour éviter les E-cores plus lents
- Sur autres systèmes, utilise la moitié des cœurs disponibles moins 1
- Utilise
mwir_system_info()pour voir la configuration détectée
candidate_modelsaccepte"powerlaw","lognormal","exponential"(et"weibull"en continu). Tu peux fournir un sous-ensemble ciblé ou changer l'ordre selon les lois pertinentes pour ton terrain.bootstrap_simscontrôle le nombre de simulations KS,bootstrap_modelsrestreint la liste des modèles simulés. Diminuebootstrap_simspour un résultat rapide, augmente-le pour plus de précision.- La sortie regroupe
best_model, les paramètres (best_fit), les comparaisons de vraisemblance (comparisons), les diagnostics bootstrap (bootstrap) et undata_summarydirectement mobilisable dans les rapports. - Bonnes pratiques : essayer plusieurs
winsorize, surveillerbest_fit$n_tail, examiner les p-values bootstrap et justifier lexminretenu.
# Display system info and recommended worker count
mwir_system_info()
# Example output on Apple Silicon Mac:
# === mwiR System Info ===
# OS: Darwin
# Architecture: arm64
# Total cores: 10
# Performance cores: 8
# Efficiency cores: 2
# Recommended workers: 7
# Apple Silicon: Yesmwir_system_info() displays hardware detection results and the recommended number of parallel workers. Useful for:
- Verifying Apple Silicon detection
- Understanding why
threads = NULLchose a specific worker count - Troubleshooting parallelization issues
Large Language Models can speed up qualitative coding, but they demand guardrails. The unified LLM_Recode() function supports multiple providers (OpenAI, OpenRouter, Anthropic, Ollama) with a simple, consistent interface.
# Configure your preferred provider
LLM_Config(provider = "openai", lang = "fr")
# Or set API keys directly
Sys.setenv(OPENAI_API_KEY = "sk-...")
Sys.setenv(OPENROUTER_API_KEY = "orpk-...")
Sys.setenv(ANTHROPIC_API_KEY = "sk-ant-...")# Simple translation of a vector
translations <- LLM_Recode(
data = c("Hello world", "Automation and labour markets"),
prompt = "Traduire en français: {value}",
temperature = 0.4
)# Process multiple columns using glue templates
results <- LLM_Recode(
data = my_dataframe,
prompt = "Résumer en 20 mots: {title} - {content}",
provider = "openrouter",
model = "openrouter/auto",
temperature = 0.2,
max_tokens = 80,
return_metadata = TRUE,
parallel = TRUE, # Enable parallel processing
workers = 4
)
# Check failures
failed <- results[results$status != "ok", ]Entrées de LLM_Recode()
data: vecteur ou data.frame à traiter (obligatoire).prompt: template glue avec placeholders{variable}(obligatoire). Pour les vecteurs, utiliser{value}.provider:"openai","openrouter","anthropic", ou"ollama"(auto-détecté si omis).model: identifiant du modèle (défaut selon provider).temperature(0–2) : aléa de génération.max_tokens: plafond de tokens en sortie.max_retries,retry_delay,backoff_multiplier: stratégie de retry avec backoff exponentiel.rate_limit_delay: délai entre requêtes pour respecter les quotas.parallel,workers: traitement parallèle viafuture/furrr.return_metadata: retourne un data.frame avec statut, tentatives, tokens utilisés.on_error:"continue"(défaut) ou"stop"sur la première erreur.
Sortie
- Par défaut, un vecteur de chaînes recodées (NA en cas d'échec).
- Avec
return_metadata = TRUE, un data.frame avec colonnesvalue,status,attempts,tokens_used,model_used,error_message.
LLM_Config() - Configuration de session
LLM_Config(
provider = "openai", # Provider par défaut
model = "gpt-4o", # Modèle par défaut
lang = "fr", # Langue des messages (fr/en)
verbose = TRUE # Afficher les messages de progression
)Bonnes pratiques
- Documenter
prompt,sysprompt, modèle et version dans votre carnet de labo. - Baisser
temperaturepour des traductions fidèles, l'augmenter pour des reformulations créatives. - Limiter
max_tokenspour garder des réponses concises. - Utiliser
return_metadata = TRUEpour auditer les résultats. - Activer
parallel = TRUEavec modération (respecter les rate limits des APIs).
The database layer underpins every land. The following helpers keep it healthy and synchronised with external edits.
con <- connect_db()
land_id <- get_land_id(con, "AIWork")
domaines <- list_domain(con, land_name = "AIWork")connect_db()returns a ready-to-useDBIconnection.get_land_id()converts human-readable land names into numeric IDs when you automate workflows.list_domain()produces a domain summary (counts, keywords) to monitor coverage.
urls <- importFile()
addurl("AIWork", urls = urls$url)Use importFile() whenever you enrich your corpus from spreadsheets or open postings. The helper returns a data frame; pass the relevant column to addurl().
annotatedData(
dataplus = curated_notes,
table = "Expression",
champ = "description",
by = "id"
)annotatedData() wraps transactional updates so a batch edit either fully succeeds or rolls back. Always back up mwi.db before bulk reinsertion.
Beyond export_land(), the family of dedicated exporters gives you fine-grained control:
export_pagecsv()andexport_fullpagecsv()to share tabular corpora.export_nodecsv()/export_nodegexf()for network analysis.export_mediacsv()to audit associated media.export_pagegexf()for expression-level graphs.export_corpus()to assemble text files plus metadata headers (ideal for CAQDAS tools). Supports multiple formats:ext = "md"(Markdown, default),ext = "txt"(plain text),ext = "pdf"(PDF, requires rmarkdown + pandoc).
Each exporter accepts minimum_relevance, so you can balance breadth and focus depending on the audience.
User-Agent Rotation: The crawler uses a pool of 10 realistic browser User-Agents (Chrome, Firefox, Safari, Edge) randomly selected for each request to reduce 403 errors from anti-scraping systems.
SERP Metadata Preservation: When importing URLs from urlist_Google(), urlist_DuckDuckGo(), or urlist_Bing() via addurl(), the title and publication date from search results are preserved. During crawling, these values are not overwritten, ensuring SERP metadata integrity.
Archive.org Fallback: If a website blocks the crawler (403/503 errors), the system automatically attempts to fetch content from Archive.org's Wayback Machine.
