diff --git a/docs/index.md b/docs/index.md index f1969be..8e40e72 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,14 +8,7 @@ Any broad questions then please do reach out in our community space [here](https Further in development projects are [here](https://github.com/orgs/CogStack/repositories) -![](./overview/attachments/43c14755-e565-4ae0-a0a3-ec6dc18a691c.png) - -| Tool | Description | -|:-----|:------------| -| ![CogStack-Nifi](overview/attachments/36c0d23f-a632-4fbf-9f7c-6669e88bbd39.png){width=100}
[**CogStack-Nifi**](https://cogstack-nifi.readthedocs.io/en/latest/main.html) | Data flow orchestration using Apache NiFi | -| ![MedCAT](overview/attachments/09a8bb60-9864-41fa-be7b-cf9a9dc04498.png){width=100}
[**MedCAT**](https://medcat.readthedocs.io/en/latest/) | Medical Concept Annotation Toolkit | -| ![MedCATTrainer](overview/attachments/09a8bb60-9864-41fa-be7b-cf9a9dc04498.png){width=100}
[**MedCATTrainer**](https://medcattrainer.readthedocs.io/en/latest/) | Web-based annotation and training interface for MedCAT | - +![](./overview/attachments/architecture.png){width=2000} ```{toctree} :hidden: diff --git a/docs/overview/CogStack ecosystem (v1).md b/docs/overview/CogStack ecosystem (v1).md deleted file mode 100644 index 2913be5..0000000 --- a/docs/overview/CogStack ecosystem (v1).md +++ /dev/null @@ -1,152 +0,0 @@ -# CogStack ecosystem (v1) - -In this part are covered the available services that can be running in an example CogStack deployment. To such deployment with many running services we refer as an  *ecosystem* or a *platform*. Below is presented a high-level perspective of CogStack platform with the possibilities it enables through many components and services. In practice, many of the functionalities that CogStack platform enables are implemented as separate, but interconnected services working inside the ecosystem. - -## Core services - -In most scenarios CogStack platform will consist of *core* services tailored to specific use-cases. Additional application and services can be run on top of it, such as [SemEHR](../../CogStack%20General/CogStack%20Wiki/CogStack%20projects/SemEHR.md), [Patient Timeline](../../CogStack%20General/CogStack%20Wiki/CogStack%20projects/Patient%20Timeline.md), Live Alerting (through ElasticSearch plugins) or any other custom developed applications. For an ease-of-use, when deploying a sample CogStack platform, we always emphasise to use Docker Compose (see: [Running CogStack](Running%20CogStack.md)). - -Below is presented is one of the most simple and common scenarios when ingesting and processing the EHR data from a proprietary data source. - -![](./attachments/5503ea0e-ac74-40ba-936a-1287ad3f1cf5.png) - -A CogStack platform presented here consists of such core services: - -- *CogStack Pipeline* service for ingesting and processing the EHR data from the source database, -- *CogStack Job Repository* (PostgreSQL database) serving for job status control, -- *ElasticSearch* sink where the processed EHR records are stored, -- (optional) *Kibana* user interface to easily perform exploratory data analysis over the processed records. - -It is essential to note that presented is a very simplified scenario, which can be easily deployed even on a local machine with limited resources. We are also using here an optional Kibana as an out-of-the-box and easy to use solution to explore the data, although many other data analysis or BI tools can be used. Moreover, there are also available connectors to ElasticSearch in many languages, such as Java, Python, R or JavaScript allowing for fast development of custom user applications. - -:::{tip} -Note - -In the picture we only presented ElasticSearch using a single node. However, in practice, one should consider using at least 3 asticSearch nodes deployed as a cluster which greatly improves resilience, query performance and reliability. -Similarly, in the picture we only presented one CogStack Pipeline instance and only one data source. However, in practice, there may be multiple sources available with multiple Pipeline components running in parallel. This is why, when considering deploying CogStack platform in production, one should keep in mind the aspects of the scalability and resilience of the platform and running services. -::: - - -### CogStack Pipeline - -CogStack Pipeline is the main data processing service used inside the CogStack platform. Within the ecosystem it's main responsibilities is to ingest the EHR data from a specified data source, process the data (e.g. by applying the text extraction methods, records de-identification or extracting the NLP annotations) and store the resulting data in the specified sink. - -Usually, the sink will be the ElasticSearch store, keeping the processed EHRs which can be ready to use by other applications. However, when performing computationally-expensive processing tasks, such as running OCR-based text extraction from the documents, one may prefer to store the partial results in a cache. In such case, PostgreSQL can be used as a temporary store – [Examples](Examples.md) covers such case. - -The information about available data processing components offered by CogStack Pipeline can be found in [CogStack Pipeline](CogStack%20Pipeline.md) part. - -:::{info} -We recommend using CogStack Pipeline component in the newest version 1.3.0. -::: - ---- - ---- - - - -### PostgreSQL - -[PostgreSQL](https://www.postgresql.org/) is a widely used object-relational database management system. In CogStack platform it is primarily used as a job repository, for storing the jobs execution status of running CogStack Pipeline instances. However, there may be cases where one may need to store the partial results treating PostgreSQL DB either as a data cache (see: [Examples](Examples.md) ) or an auxiliary data sink. - -When used as a job repository, it requires defining appropriate tables with a user that will be used by CogStack Pipeline running instance(s). This schema is defined by [Spring Batch META-DATA schema definition](https://docs.spring.io/spring-batch/trunk/reference/html/metaDataSchema.html) and is also available in `CogStack-Pipeline/examples/docker-common/pgjobrepo/create_repo.sh` script. - -:::{Info} -We recommend using PostgreSQL in versions >= 10. -In the [Examples](Examples.md) part we use PostgreSQL in version 11.1. -::: - -:::{warning} -Note - -PostgreSQL by default has a connection limit of 100.  Since a single CogStack Pipeline instance using multiple processing threads uses a connection pool both for retrieving the EHR data from the database source and to update the job repository, one may need to increase the default connection limit with the available memory buffers. To do so, one may specify parameters: `"-c 'shared_buffers=256MB' -c 'max_connections=1000'"` when initialising the database. -::: - -### ElasticSearch - -[ElasticSearch](https://www.elastic.co/guide/) is a popular NoSQL search engine based on the Lucene library that provides a distributed full-text search engine storing the data as schema-free JSON documents. Inside CogStack platform it is usually used as a primary data store for processed EHR data by CogStack Pipeline. - -Depending on the use-case, the processed EHR data is usually stored in indices as defined in corresponding CogStack Pipeline job description property files (see: [CogStack Pipeline](CogStack%20Pipeline.md)). Once stored, it can be easily queried either by using the own's REST API (see: [ElasticSearch Search API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html)), queried using [Kibana](#kibana) or queried using a ElasticSearch connector available in many programming languages. ElasticSearch apart from standard functionality and features provided in its open-source free version also offers more advanced ones distributed as [Elastic Stack](https://www.elastic.co/products/stack) (formerly: X-Pack extension) which require license. These include modules for machine learning, alerting, monitoring, security and more. - -:::{tip} -In our [Examples](Examples.md) we use the free, open-source version of ElasticSearch without the Elastic Stack modules included. It needs to be noted that in cases when one requires a secure and/or granular access to the processed EHR data in ElasticSearch sink, one should explore the [Security](https://www.elastic.co/guide/en/x-pack/current/elasticsearch-security.html) module (formerly: Shield) offered in the Elastic Stack. Some of the features include (as stated the official website): -- Preventing unauthorised access with password protection, role-based access control (even per index- or single document-level), and IP filtering. -- Preserving the integrity of your data with message authentication and SSL/TLS encryption. -- Maintaining an audit trail so one know who’s doing what to your cluster and the data it stores. -CogStack Pipeline fully supports the functionality provided by the ElasticSearch Security module used to securely access the node(s). -::: - -:::{Info} -In our [Examples](Examples.md) we use a simple, single-node ElasticSearch deployment. However, in practice, one should consider using at least 3 ElasticSearch nodes deployed as a cluster which greatly improves resilience, query performance and reliability. -::: - -:::{important} -We recommend using ElasticSearch in versions >= 6.0. -::: - - -:::{warning} -Note - -If ElasticSearch service does not start up and such error is reported: - -> elasticsearch    | ERROR: [1] bootstrap checks failed -> elasticsearch    | [1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536] - -one may need to increase the number of available file descriptors on the **host** machine – please refer to:  -::: - -:::{warning} -Note - -If ElasticSearch service does not start up and such error is reported: - -> elasticsearch    | ERROR: [1] bootstrap checks failed -> elasticsearch    | [1]: max virtual memory areas vm.max\_map\_count [65530] is too low, increase to at least [262144] - -one may need to increase the number of available virtual memory on the **host** machine – please refer to:  -::: - ---- - ---- - -### Kibana - -[Kibana](https://www.elastic.co/products/kibana) is a data visualisation module for ElasticSeach that be easily used to explore and query the data. In sample CogStack platform deployments it can be used as a ready-to-use data exploration tool. - -Apart from providing exploratory data analysis functionality it also offers administrative options over the ElasticSearch data store, such as adding/removing/updating the documents using command line or creating/removing indices. Moreover, custom user dashboards can be created according to use-case requirements. For a more detailed description of the available functionality please refer to the [official documentation](https://www.elastic.co/guide/en/kibana/current/introduction.html). - -:::{info} -In all our [Examples](Examples.md) we provide ElasticSearch bundled with Kibana. -::: - ---- - ---- - -### NGINX - -NGINX is a popular, open-source web server that can also be used as a reverse proxy, load balancer, HTTP cache and more. In CogStack platform deployments, it can be used as a reverse-proxy and providing a basic security access to the exposed data stores and service endpoints. Some of the functionality may include general user-based authentication, IP filtering and selective service access. A more detailed description of security features offered by NGINX can be found in the [official documentation](https://docs.nginx.com/nginx/admin-guide/security-controls/). - -[Examples](Examples.md) covers a simple use-case with NGINX serving as a basic authentication module. The example configuration of NGINX running as a proxy can be found in `CogStack-Pipeline/examples/docker-common/nginx/config/` directory. - -:::{info} -It needs to be noted, however, that the security and granularity of access to the data stored in ElasticSearch offered by NGINX is inferior to using the [Security](https://www.elastic.co/guide/en/x-pack/current/elasticsearch-security.html) module from Elastic Stack. -::: - ---- - ---- - -### Fluentd - -[Fluentd](https://www.fluentd.org/) is an open source data collector providing a unified logging layer. In sample CogStack platform deployments it can be used running as a service collecting the logs from all the running services which can be used for auditing. - -Fluentd provides a highly configurable and flexible set of rules, filters and plugins that can be used to set the logging for any running service inside the platform. The [official Fluentd documentation](https://docs.fluentd.org/v1.0/articles/quickstart) covers many Fluentd examples with detailed description. - -[Examples](Examples.md) covers a simple use-case with using Fluentd for logging. The example configuration file can be found in `CogStack-Pipeline/examples/docker-common/fluentd/conf/` directory. - ---- - ---- diff --git a/docs/overview/Data pipelines.md b/docs/overview/Data pipelines.md index 541b071..79f3d5e 100644 --- a/docs/overview/Data pipelines.md +++ b/docs/overview/Data pipelines.md @@ -1,44 +1,16 @@ - - - # Data pipelines ## Introduction This page covers the data pipelines used in CogStack ecosystem. -:::{warning} -Please note that CogStack-Pipeline was the initial implementation of CogStack platform and this pipeline engine is being deprecated – we are moving forward with porting the existing pipeline functionality using Apache NiFi as the main data processing engine (see below: **CogStack-NiFi**). -::: - -## CogStack-Pipeline - -### Overview - -CogStack-Pipeline is an application for executing data pipelines for performing EHR data ingestion from databases to ElasticSearch (primarily) or other databases. It implements a fixed set of ETL operations including extraction of text from binary documents using Apache Tika, running NLP applications based on [GATE NLP suite](https://gate.ac.uk/) and a custom de-identification application based on text scrubbing. It was build in Spring Batch and implements only a document-oriented data processing model. For a complete description on CogStack-Pipeline please refer to [the official documentation](https://cogstack.atlassian.net/wiki/spaces/COGDOC). - -:::{IMPORTANT} -The latest version of CogStack Pipeline is 1.3.1. -::: - -### Key resources - -- Documentation: [https://cogstack.atlassian.net/wiki/spaces/COGDOC](/wiki/spaces/COGDOC) -- Deployment examples: [Examples](Examples.md) -- GitHub: -- DockerHub: - ## CogStack-NiFi ### Overview +CogStack-NiFi is the re-architected version of CogStack-Pipeline that replaces the fixed Spring Batch-based pipeline engine with [Apache NiFi](https://nifi.apache.org/). It focuses on fully configurable and scalable data flows with the data processing engine that is easy to use, deploy and tailor to any site-specific data flow requirements. Apache NiFi also comes in with build-in monitoring, data provenance and security features that puts the operations in better control and reliability.
**CogStack-NiFi useful links:**

**Apache NiFi resources:**
  • The official website: https://nifi.apache.org/

  • The official documentation: https://nifi.apache.org/docs.html -| | | -|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------| -| CogStack-NiFi is the re-architected version of CogStack-Pipeline that replaces the fixed Spring Batch-based pipeline engine with [Apache NiFi](https://nifi.apache.org/). It focuses on fully configurable and scalable data flows with the data processing engine that is easy to use, deploy and tailor to any site-specific data flow requirements. Apache NiFi also comes in with build-in monitoring, data provenance and security features that puts the operations in better control and reliability.
    **CogStack-NiFi useful links:**


    **Apache NiFi resources:**

    | ![](./attachments/b5fc6b57-faf2-4747-9e77-eb9adf51d8b3.jpg) | +![](./attachments/b5fc6b57-faf2-4747-9e77-eb9adf51d8b3.jpg) -:::{IMPORTANT} -Please note that CogStack-NiFi project is still under active development with the newest version **0.1.0**. -::: ### Apache NiFi – overview @@ -74,11 +46,8 @@ Each ingestion job that is being run by CogStack-Pipeline also requires a separa Moreover, one of the main limitations of CogStack pipeline has been support only for a document-centric data model for performing ingestion where each ingested record could only contain one document to be processed. Apache NiFi does not enforce document-centric data model and provides flexibility on defining custom data flows and data schemas. Handling multiple documents in a single record or using a patient-centric data model is a matter of tailoring the pipeline and defining or tailoring appropriate schema. -Moreover, fixed ETL operations (implemented as modules in CogStack-Pipeline) can be included as custom ETL scripts or application modules inside a defined Apache NiFi data flow. For example, the text extraction done by [Apache Tika](https://tika.apache.org/) and NLP functionality (such as running [MedCAT](https://github.com/CogStack/MedCATservice) or [GATE NLP](https://github.com/CogStack/gate-nlp-service) applications was implemented as external micro-services exposing that expose a REST API and hence can be used directly in the data flow. All the third-party application dependencies are handled by the external services that further allows for separating the responsibilities. +Moreover, fixed ETL operations (implemented as modules in CogStack-Pipeline) can be included as custom ETL scripts or application modules inside a defined Apache NiFi data flow. For example, NLP functionality, such as running [MedCAT](https://github.com/CogStack/MedCATservice) was implemented as external micro-services exposing that expose a REST API and hence can be used directly in the data flow. All the third-party application dependencies are handled by the external services that further allows for separating the responsibilities. -:::{IMPORTANT} -Please note that the recommended minimal resources requirements for running Apache NiFi will be higher than for CogStack-Pipeline and these will depend on the actual use-case. -::: ### Example deployment and services diff --git a/docs/overview/Elasticsearch.md b/docs/overview/Elasticsearch.md index 113665b..58a3750 100644 --- a/docs/overview/Elasticsearch.md +++ b/docs/overview/Elasticsearch.md @@ -1,7 +1,7 @@ -# Elasticsearch +# Elasticsearch / OpenSearch ## Introduction diff --git a/docs/overview/Natural Language Processing.md b/docs/overview/Natural Language Processing.md index 21af7f8..794e35b 100644 --- a/docs/overview/Natural Language Processing.md +++ b/docs/overview/Natural Language Processing.md @@ -64,52 +64,6 @@ Key resources: Please note that there is available public MedCAT model trained on MedMentions corpus that can be used to play with. ::: -## GATE NLP applications - -### Overview of GATE NLP suite - -[GATE NLP suite](https://gate.ac.uk/) is a well established and rich set of open-source technologies implementing full-lifecycle solution for text processing. The GATE ecosystem is very broad and outside of the scope of this documentation – here we will only focus on two applications: - -- [GATE Developer](https://gate.ac.uk/family/developer.html), -- [GATE Embedded](https://gate.ac.uk/family/embedded.html). - -GATE Developer is a development environment that provides a large set of graphical interactive tools for the creation, measurement and maintenance of software components for natural language processing. It allows to design, create and run NLP applications using an intuitive user interface. These applications can be later exported as a custom *gapp* or *xgapp* application with the used resources. - -GATE Embedded, on the other hand, is an object-oriented framework (or class library) implemented in Java. It is used in all GATE-based systems, and forms the core (non-visual) elements of GATE Developer. In principle, it implements the runtime for executing GATE applications. It allows to run the *gapp* and *xgapp* applications that have been previously created in GATE Developer. - - -:::{IMPORTANT} -When deploying GATE applications within CogStack one may be interested in defining and tailoring custom GATE applications directly by using GATE Developer. Such prepared application can be in the next step provided into CogStack **GATE NLP Runner Service** that uses GATE Embedded to execute GATE applications. This way, provided NLP application can be deployed as a service and used in the data pipeline. -::: - -Although there have been developed and published many applications in GATE NLP suite, in this page we only briefly cover Bio-YODIE. - -### Bio-YODIE - -Bio-YODIE is a named entity linking system derived from GATE YODIE system. It links mentions in biomedical text to their referents in the UMLS. It defines a broad set of types such as `Disease` , `Drug`, `Observation` and many more all of the types belonging to `Bio` group – for detailed information please refer to [the official documentation](https://gate.ac.uk/applications/bio-yodie.html). - -Bio-YODIE can be run either within GATE Developer application or as a service within CogStack (based on GATE Embedded and running as a Service). Here we primarily focus on the latter and refer the reader to the official Bio-YODIE website. - -**Key resources:** - -- The official website: -- GitHub repository with application code: -- GitHub repository with code to prepare UMLS resources for Bio-YODIE: - -:::{WARNING} -Please note that Bio-YODIE requires resources to be prepared using UMLS. These are limited by individual license and cannot be openly shared. -::: - -### GATE NLP Runner service - -CogStack implements a GATE NLP Runner service that serves the GATE NLP applications as a service exposing RESTful API. It is using GATE Embedded to execute the GATE applications that are provided either in *gapp* or *xgapp* format. The API specification is provided in the sections below. - -For more information please refer to the official GitHub with code and documentation: - -## NLP REST API - -CogStack defines a simple, uniform, RESTful API for free-text documents processing. It’s primary focus has been on providing an application independent and uniform interface for extracting entities from the free-text. The data exchange should be stateless and synchronous. The use-case is: given a document (or a corpus of documents) extract the recognised named entities with associated meta-data. This way, any NLP application can be used or any NLP model can be served in the data processing pipeline as long as it stays compatible with the interface. - ### REST API definition The API defines 3 endpoints, that consume and return data in JSON format: @@ -118,8 +72,6 @@ The API defines 3 endpoints, that consume and return data in JSON format: - *POST* `/api/process` - processes the provided single document and returns back the annotations, - *POST* `/api/process_bulk` - processes the provided list of documents and returns back the annotations. -The full definition is available as [OpenAPI or Swagger](https://github.com/CogStack/gate-nlp-service/tree/devel/api-specs) specification. - #### GET `/api/info` Returns information about the used NLP application. The returned fields are: @@ -166,7 +118,7 @@ Here, the `content` object holds an array of single document content to be proce ### Example use :::{tip} -Please see [CogStack using Apache NiFi Deployment Examples](https://github.com/CogStack/CogStack-NiFi/tree/devel/deploy) to see how to deploy example NLP services, i.e. MedCAT with a public MedMentions model and example GATE NLP Drug application. +Please see [CogStack using Apache NiFi Deployment Examples](https://github.com/CogStack/CogStack-NiFi/tree/devel/deploy) to see how to deploy example NLP services, i.e. MedCAT with a public MedMentions model. ::: #### MedCAT @@ -205,117 +157,4 @@ and the received result: "timestamp": "2019-12-03T16:09:58.196+00:00" } } -``` - -### Bio-YODIE - -Bio-YODIE is being run as a service using CogStack GATE NLP Runner Service as described above. In this example Bio-YODIE application will only output annotations of `Disease` type from `Bio` group (defined in the service configuration file). Assuming that the service is running on the `localhost` with the API exposed on port `8095`, so one can run: - -```bash -curl --header "Content-Type: application/json" \ - --request POST \ - --data '{"content":{"text": "lung cancer diagnosis"}}' \ - http://localhost:8095/api/process -``` - -and the received result: - -```json -{ - "result": { - "text": "lung cancer diagnosis", - "annotations": [ - { - "end_idx": 11, - "set": "Bio", - "Negation": "Affirmed", - "Experiencer": "Patient", - "PREF": "Lung Cancer", - "end_node_id": "17", - "TUI": "T191", - "language": "", - "start_node_id": "16", - "type": "Disease", - "LABELVOCABS": "CHV,MEDLINEPLUS,MSH", - "CUIVOCABS": "MTH,CHV,MSH,SNOMEDCT_US,NCI,LCH_NW,OMIM,MEDLINEPLUS,COSTAR,NCI_CTRP-SDC", - "inst_full": "http://linkedlifedata.com/resource/umls/id/C0242379", - "inst": "C0242379", - "string_orig": "lung cancer", - "STY": "Neoplastic Process", - "start_idx": 0, - "id": 18, - "text": "lung cancer", - "Temporality": "Recent", - "tui_full": "http://linkedlifedata.com/resource/semanticnetwork/id/T191" - } - ], - "metadata": { - "document_features": { - "keyOverlapsOnly": false, - "gate.SourceURL": "created from String", - "docType": "generic", - "deleteNonNNPLookups": "true", - "lang": "en" - } - }, - "success": true, - "timestamp": "2019-12-03T16:10:13.281+00:00" - } -} -``` - -### Extra: a simple GATE-based drug names extraction application - -As an extra example, a simple application for extracting drug names from the free-text was developed in GATE Developer using ANNIE Gazetteer. It uses as an input the data downloaded from [Drugs@FDA database](https://www.accessdata.fda.gov/scripts/cder/daf/) and further refined giving a curated list of drugs and active ingredients. The application functionality is exposed using CogStack GATE NLP Runner Service. - -Similarly as in above, assuming that the application is running on the `localhost` with the API exposed on port `8095`, one can run: - -```bash -curl -XPOST http://localhost:8095/api/process \ - -H 'Content-Type: application/json' \ - -d '{"content":{"text":"The patient was prescribed with Aspirin."}}' - -``` - -and the received result: - -```json -{ - "result": { - "text": "The patient was prescribed with Aspirin.", - "annotations": [ - { - "end_idx": 39, - "majorType": "Drug", - "set": "", - "name": "ASPIRIN", - "start_idx": 32, - "language": "", - "id": 12, - "minorType": "ActiveComponent", - "text": "Aspirin", - "type": "Drug" - }, - { - "end_idx": 39, - "majorType": "Drug", - "set": "", - "name": "ASPIRIN", - "start_idx": 32, - "language": "", - "id": 13, - "minorType": "Medication", - "text": "Aspirin", - "type": "Drug" - } - ], - "metadata": { - "document_features": { - "gate.SourceURL": "created from String" - } - }, - "success": true, - "timestamp": "2019-12-04T09:51:32.246Z" - } -} -``` +``` \ No newline at end of file diff --git a/docs/overview/_index.md b/docs/overview/_index.md index fe014a8..48d7bcc 100644 --- a/docs/overview/_index.md +++ b/docs/overview/_index.md @@ -3,7 +3,6 @@ ```{toctree} :maxdepth: 1 cogstack-documentation -CogStack ecosystem (v1) Data pipelines Elasticsearch Natural Language Processing diff --git a/docs/overview/attachments/36c0d23f-a632-4fbf-9f7c-6669e88bbd39.png b/docs/overview/attachments/36c0d23f-a632-4fbf-9f7c-6669e88bbd39.png deleted file mode 100644 index f3f46cc..0000000 Binary files a/docs/overview/attachments/36c0d23f-a632-4fbf-9f7c-6669e88bbd39.png and /dev/null differ diff --git a/docs/overview/attachments/43c14755-e565-4ae0-a0a3-ec6dc18a691c.png b/docs/overview/attachments/43c14755-e565-4ae0-a0a3-ec6dc18a691c.png deleted file mode 100644 index 0e573ff..0000000 Binary files a/docs/overview/attachments/43c14755-e565-4ae0-a0a3-ec6dc18a691c.png and /dev/null differ diff --git a/docs/overview/attachments/54bb85e8-0428-4a56-a702-fd359272ed6e.png b/docs/overview/attachments/54bb85e8-0428-4a56-a702-fd359272ed6e.png deleted file mode 100644 index 0e573ff..0000000 Binary files a/docs/overview/attachments/54bb85e8-0428-4a56-a702-fd359272ed6e.png and /dev/null differ diff --git a/docs/overview/attachments/5503ea0e-ac74-40ba-936a-1287ad3f1cf5.png b/docs/overview/attachments/5503ea0e-ac74-40ba-936a-1287ad3f1cf5.png deleted file mode 100644 index f4df7d8..0000000 Binary files a/docs/overview/attachments/5503ea0e-ac74-40ba-936a-1287ad3f1cf5.png and /dev/null differ diff --git a/docs/overview/attachments/architecture.png b/docs/overview/attachments/architecture.png new file mode 100644 index 0000000..fd8123b Binary files /dev/null and b/docs/overview/attachments/architecture.png differ diff --git a/docs/overview/cogstack-documentation.md b/docs/overview/cogstack-documentation.md index f28e402..f5539b8 100644 --- a/docs/overview/cogstack-documentation.md +++ b/docs/overview/cogstack-documentation.md @@ -1,22 +1,21 @@ - - - # CogStack Documentation ## What is CogStack? -CogStack is a lightweight distributed, fault tolerant database processing architecture and ecosystem, intended to make NLP processing and preprocessing easier in resource constrained environments. It comprises of multiple components, and has been designed to provide configurable data processing pipelines for working with EHR data. For the moment it mainly uses databases and files as the primary source of EHR data with the possibility of adding custom data connectors in the near future. It makes use of the [Apache-Nifi](https://nifi.apache.org/) framework in order to provide a fully configurable data processing pipeline with the goal of generating annotated JSON standardised schema files that can be readily indexed into [ElasticSearch](https://www.elastic.co/), stored as files or pushed back to a database. +CogStack is a lightweight distributed, fault tolerant database processing architecture and ecosystem, intended to make NLP processing and preprocessing easier in resource constrained environments. It comprises of multiple components, and has been designed to provide configurable data processing pipelines for working with EHR data. + +CogStack uses databases and files as primary sources of EHR data, with support for custom data connectors. The platform leverages [Apache NiFi](https://nifi.apache.org/) to provide fully configurable data processing pipelines with the goal of generating annotated JSON standardised schema files that can be readily indexed into [ElasticSearch](https://www.elastic.co/), stored as files or pushed back to a database. -![](./attachments/54bb85e8-0428-4a56-a702-fd359272ed6e.png) +![](./attachments/architecture.png) -The CogStack ecosystem has been developed as an open source project with the code available on GitHub: [https://github.com/CogStack/](https://github.com/CogStack/CogStack-Pipeline) . +CogStack is a commercial open-source product, with the code available on GitHub: [https://github.com/CogStack/](https://github.com/CogStack/) . For enterprise deployments, full platform setup, and advanced features, please [contact us](https://docs.cogstack.org/en/latest/). :::{tip} -Starting from version 1.2 CogStack is preferably being run as an ecosystem using a set of different microservices and deployed using [Docker Compose](https://docs.docker.com/compose/). The ready-to-use CogStack images are available to pull directly from the official Docker Hub under [cogstacksystems](https://hub.docker.com/u/cogstacksystems/) organisation. We’re actively pursuing running the stack in a K8s cluster also. +CogStack is designed as a microservices-based ecosystem. The recommended deployment method is on **Kubernetes using Helm charts**, which provides cloud-native support, scalability, and reliability. Ready-to-use CogStack images are available from the official Docker Hub under the [cogstacksystems](https://hub.docker.com/u/cogstacksystems/) organisation. Docker Compose is still supported for development and smaller deployments, but Kubernetes is recommended for production environments. ::: ## Why does this project exist? -The CogStack consists of a range of technologies designed to to support modern, open source healthcare analytics within the NHS, and is chiefly comprised of the Elastic stack ([ElasticSearch](https://www.elastic.co/products/elasticsearch), [Kibana](https://www.elastic.co/products/kibana), etc.), [MedCAT](https://github.com/CogStack/MedCAT) (clinical natural language processing for named entity extraction and linking), clinical text [OCR](https://github.com/CogStack/ocr-service), clinical text de-identification. Since the processed EHR data can be represented and stored in databases or ElasticSearch, CogStack can be perfectly utilised as one of the solutions for integrating EHR data with other types of biomedical, -omics, wearables data, etc. +CogStack consists of a range of technologies designed to support modern, open source healthcare analytics, and is chiefly comprised of the Elastic stack ([ElasticSearch](https://www.elastic.co/products/elasticsearch), [Kibana](https://www.elastic.co/products/kibana), etc.), [MedCAT](https://github.com/CogStack/MedCAT) (clinical natural language processing for named entity extraction and linking, contextualization, and realtion extraction), clinical text [OCR](https://github.com/CogStack/ocr-service), and clinical text de-identification. Since the processed EHR data can be represented and stored in databases or ElasticSearch, CogStack can be perfectly utilised as one of the solutions for integrating EHR data with other types of biomedical, -omics, wearables data, etc. --- \ No newline at end of file