Skip to content

About how to download raw PDF files using 2012_manifest.tsv #111

@kevindryz

Description

@kevindryz

I didn't understand the specific download process, and my company does not allow direct downloads from cloud storage. However, based on the dc_slug in the TSV file, I have a general idea of how to find the original PDF URL.

For example, for a dc_slug like 456300-sept-17-23-2012-11953-13474707086771-_-pdf, I can use the split function to split at the first hyphen and then construct the URL as follows:

url = f'https://s3.amazonaws.com/s3.documentcloud.org/documents/{456300}/{sept-17-23-2012-11953-13474707086771-_-pdf}.pdf'

This way, I can directly access the PDF!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions