Skip to content

Exclude images from download #13

@o-sapov

Description

@o-sapov

Is this repository an appropriate place to asking questions about this client: https://pypi.org/project/transkribus-client?

How it would be possible to exclude images from download?
I found that there should be an option bNoImage (cf. https://github.com/Transkribus/TranskribusPyClient/blob/master/src/TranskribusPyClient/client.py).

But it is not clear for me where it should be passed in? Here is how the download was implemented by a predecessor of me:

Click to see the class.

import logging
import os
import shutil
from pathlib import Path
from typing import List

from classes.conf import ConverterConfig
from classes.logger import Logger
from tqdm import tqdm
from transkribus import TranskribusAPI
from transkribus.models import Collection, Document

CONF = ConverterConfig()
LOG = Logger().get_logger()


class TranskribusDownloader:
    """
        TranskribusDownloader is a wrapper for inofficial transkribus-client
        (https://gitlab.com/arkindex/transkribus/-/blob/master/transkribus/api.py)
    """

    api: TranskribusAPI

    def downloadDocuments(
        self,
        colId: int,
        docIds: List[int],
        downloadDir: str,
        usePreviousDownload: bool
    ) -> List[Path]:
        """Downloads documents from Transkribus and returns 
        a list of Paths of the directories containing the downloaded files.
        Attention! removes every file from downloadDir first

        Args:
            colId (int): id of transkribus collection
            docIds (List[int]): ids of transkribus documents
            downloadDir(str): parent directory for downloads
        Returns:
            List of Paths of the directories containing the downloaded files
        """
        directories = []
        downloadDir = Path(downloadDir)
        if not usePreviousDownload:
            for subdirectory in [f.path for f in os.scandir(downloadDir) if f.is_dir()]:
                shutil.rmtree(subdirectory)

        for docId in tqdm(docIds, desc='Downloading documents'):
            if not usePreviousDownload:
                collection: Collection = Collection(int(colId))
                doc: Document = Document(
                    collection,
                    int(docId)
                )
                LOG.info(
                    f'Downloading document {docId} from collection {colId}')
                doc.download(self.api, downloadDir)
            directories.append(Path(f'{downloadDir}/{docId}'))
        return directories

    def __init__(self):
        try:
            self.api = TranskribusAPI()
            self.api.login(CONF.transkribusUser(), CONF.transkribusPassword())
        except:
            LOG.error('No transkribus credentials provided. Could not log in.') 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions