Skip to content

SETI-like crowdsourcing #6

@morgajel

Description

@morgajel

This ended up a lot longer than I intended, but it started with the question, "How can we crowdsource this?"

Concept

Let's take a page out of SETI@Home's playbook from the 90's- break the data into easier-to-manage chunks and allow users to contribute their findings in an organized manner that balances security and trustworthiness with ease of use.

Concerns

There is a LOT of data, and human verification of everything is not feasible, however we can use automation and bots to help verify data.

We'll need it, because contamination from bad actors is a major concern. Reports need to trace back to individual accounts so that if one or more accounts are attempting to act in bad faith, their content can be re-validated and/or removed easily. Every step should be re-verifiable, and should be validated before it's accepted.

Disclaimer

I'm just an IT guy strong sense of justice any a quickly disintegrating brain. I don't know if this will be useful, but I wanted to get my ideas out and to someone who might be able to make it happen.

Technologies

Data Storage

I think SQLite might be a good choice. We'll need to use some type of database that can be stored in git without requiring a server to run it. User interactions should be read-only, and any PRs with changes to the database directly should be auto-rejected. I'm not sure if SQLite is the best choice (or even feasible), so consider it a placeholder technology.

Bots

Any changes made to the database should be centralized and not done via PRs- accept reviews and summaries from PRs, but any database changes should be centralized and automated based on those reports- it'll make merge conflicts much less frightening. It can be Github Actions, TravisCI, whatever- it just needs to be automated and defensive.

AI

I'll leave that to you to determine the acceptability of how AI will be leveraged (if at all).

Terminology:

To keep things easier to understand, I'm pre-defining some terms for later use:

  • Master_EpFile_Index: Single SQLite DB containing all 13,000,000+ records
  • Sub_EpFile_Index: A subsection of the master list of files (one of ~260) containing 50,000 records

Ways to Ensure Trustworthiness

  • Signed commits, checksums at each step. each step dependent on the previous being valid (like blockchain)
  • Example: Subindex Validation. Create Checksum from Master_EpFile_Index_Record and compare to MIR_Checksum. Do the same for Subindex record values. Ensure all 3 checksums match.

Step 1: Initial Repo Setup

A good deal of work needs to be done ahead of time- Prepping the repository to handle what needs to be done and getting the automation and verification in place. The work done in the beginning will be the foundation for what comes next.

  1. Identify the Files. Arguably the hard part- gather a list of all 13 million files and their metadata: URLs, filenames, sizes, checksums- this is a herculean task in and of itself.

  2. Create Master Index of all files. This would be a SQLite DB with a single table containing the following:

    • Master_EpFile_Index
      • MIR_ID: UUID Master Index Record identifier
      • File_URL: Original Source URL
      • File_Name: human/machine friendly name of file once downloaded (removing/replacing any bad characters)
      • File_Size: in bytes
      • File_Checksum: either SHA256 or SHA512 to reduce chance of intentional collisions.
      • is_validated: (boolean) URL has been tested, and checksum and filesize match.
      • Review_count: Number of reviews (maintained by bot).
        This would need to be updated/revalidated weekly to add new file AND to ensure files don't go missing or get changed from the source, and flag any that do. Any changes would need to be investigated before the master index is updated. In addition, we may also include Approved Metadata:
    • EpFile_Entities
      • Entity_ID: UUID for Lookup table of businesses, organizations, and people.
      • Legal_Name: usually the formal version of the Common Name, William Jefferson Clinton, Amazon.com Inc, The Federalists Society, etc.
      • Common_Name: Common name used in informal conversations, e.g. Bill Clinton, Amazon, The Federalists.
      • First_Name: First name (if applicable), usually informal, e.g. Bill.
      • Middle_Name Middle Name (if applicable)
      • Last_Name: Last name (if applicable)
    • EpFile_Nicknames
      • Entity_ID_FK: Reference to Sub_EpFile_Entities:Entity_ID
      • Nickname: Common alias for a person, e.g. Bubba
    • EpFile_Location
      • Location_ID: UUID of a specific location
      • Location_Name: Actual Location referenced
      • Parent_Location_ID_FK: reference to parent location (if applicable) (e.g. Tallahassee might have Florida as a parent)
      • MIR_ID_FK: reference to Master_EpFile_Index:MIR_ID
  3. Split it up Many hands make quick work. Since 13,000,000+ files is too much to work with, create a second list of sub-indexes broken into manageable chunks of 50,000 entries, or 260 SQLite DBs. The subset delineation it to help divide and conquer. It would be based on file ID rather than content, allowing weaker systems to contribute. These sub-index SQLite files would contain the following:

  • Sub_EpFile_Index
    • Subset_Index_Number Numeric Identifier of the index subset (should be between 1 and ~260)
    • SIR_ID: UUID for this Subset Index Record
    • MIR_Checksum: Checksum created from the matching fields in the correlating Master_EpFile_Index:MIR_ID
      • MIR_ID
      • File_URL
      • File_Name
      • File_Size
      • File_Checksum
    • MIR_ID: value of UUID in the Master_EpFile_Index (for validation purposes)
    • File_URL: (for validation purposes)
    • File_Name: (for validation purposes)
    • File_Size: (for validation purposes)
    • File_Checksum: (for validation purposes)
    • File_Type: file mimeType
    • Review Count: (calculated) number of reviews.
  • Sub_EpFile_Review
    • Review_ID: UUID for this Review.
    • Sub_Index_Number: Numeric Identifier of the index subset (should be between 1 and ~260) (for validation purposes)
    • SIR_ID_FK: foreign key of Sub_EpFile_Index:SIR_ID
    • GitHub_Username: Person responsible for creating this review
    • Review_Checksum: Checksum of the review
    • Review_Output_File: Location of the review, likely something like reviews/{{Sub_Index_ID}}/{{Sub_Index_Record_ID}}/{{Review_ID}}
    • Review_Summary_File: Location of the review, likely something like summary/{{Sub_Index_ID}}/{{Sub_Index_Record_ID}}/{{Review_ID}}
    • Report_Signature: Signature from User asserting review and summary authenticity (for validation purposes)
    • Report_Type: Would be one of a list of options: OCR conversion, Facial Recognition, Audio Transcript, Video Description, etc
    • Script_CommitID: Commit ID of software creating the review (to ensure reproducibility).
    • Sub_EpFile_Review_Entity_Lookup
      • Review_ID_FK
      • SIR_ID_FK
      • Entity_ID_FK
      • Mentions: Number of mentions of an Entity's Legal Name, Common Name, Legal Name, First name, and any Nicknames.
    • Sub_EpFile_Timestamp
      • Timestamp_ID: UUID of a specific time
      • Datestamp: Actual date referenced (if applicable)
      • Timestamp: Actual Timestamp referenced (if applicable, variable precision)
      • Review_ID_FK: reference to review where timestamp was mentioned.
      • SIR_ID_FK: reference to Sub_EpFile_Index:SIR_ID

This should give is a pretty good base to start organizing data.

Step 2: End User Participation

  1. Download this repository.
  2. Perform initial setup steps establishing identity and credibility, GPG signatures, etc.
  3. Run script (parameters can be determined later). This is where a bulk of the work is done:
    1. Select an Index at random
    2. Select a batch of files from random,
    3. Ask users for approval.
    4. If approved, perform the following for each file:
      1. Download file.
      2. Validate file is authentic
      3. Perform required conversion step to create initial review (e.g. OCR to text) found at Review_Output_File.
      4. Generate Summary of Review_Output_File including any mentions of names, places, or entities and save it to Review_Summary_File.
      5. Sign Review and append to Summary.
      6. Sign Summary
    5. Display summaries
    6. Prompt user to submit findings.

Step 3: Data Integration

  1. Results submtted via a PR.
  2. Bot Validates the following:
    • Github User is in "good standing"
    • all checksums from MIR_ID to Summary match expected results.
  3. If everything is above board, merge data from summary into the Approved Metadata tables.

TODO

  • How do we handle subsidiaries of orgs or businesses?
  • how do we tie individuals to entities?
  • how to we classify "hangouts" or groups of friends?

Feel free to take this idea, reshare, implement, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions