Skip to content
scottdbrown edited this page Dec 13, 2013 · 8 revisions

Database

Purpose

This project includes a database implementation for the storage and analysis of epigenetics data. The goal is to produce an open-source database implementation and provide an open repository for the collection of public epigenetic data in a format that use relevant and useful. This does not suggest that the database will be a repository for raw data, but will hold processed and useful data sets.

Introduction

Conventions

To keep things consistent and searchable, several conventions must be followed for the key/value pairs that are entered into the database.

  1. Keys must be entered in all lower case letters.
  2. Chromosome names must be entered in all lower case letters, and prefixed with "chr"

Schema

The schema for this database will be relatively fluid, as it will be built in a MongoDB environment, with the expectation that sharding will be applied and the database scaled to run over a cluster of machines. However, neither sharding or a cluster will be a requirement - simply an implementation detail.

For data not yet released to the public, we may consider "hidden" flags to prevent the data from being visible. This will require a web interface/api that respects a hidden flag, and permissions appropriate to the task.

The current schema for the yeast epigentic ChIP-chip data is as follows:

waves Collection:

  • _id - automatically generated when a wave is added to the database, this attribute is an ObjectId.
  • pos - the chromosomal position of the wave
  • height - the height of the wave
  • chr - the chromosome name using roman numerals (ex. chrVII)
  • stddev - the standard deviation of the wave's gaussian distribution
  • sample_id - a reference to the _id of an entry in the samples Collection

samples Collection

  • _id - automatically generated when a wave is added to the database, this attribute is an ObjectId.
  • antibody - this is the antibody used in the ChIP-chip experiment
  • antibody_volume - the amount of antibody used
  • array_lot_number - the lot number of the array used
  • array_type - the type of array (ex. GeneChip S. cervisiae Tiling 1.0R Array)
  • catalog_number - catalog number of the array
  • comments - any comments regarding the sample
  • crosslinking_time - length of time used for crosslinking
  • exp_date - date of the experiment
  • file_name - name of the file that was used to import the waves
  • haswaves - true if the sample has associated waves in the waves Collection
  • hide - true if the sample should be hidden from web browser interface
  • input_file - path and name of the .wig file used to generate waves
  • make_wig - true or false if WaveGenerator created the wig file.
  • min_height - value of min_height used in the WaveGenerator
  • mutations - mutation state of the sample
  • number_waves - WaveGenerator setting on whether to display number next to wave
  • output_path - directory where WaveGenerator saved the .waves file
  • processor_threads - number of processors used by the WaveGenerator
  • protocol - which protocol was used in the experiment (ex. T7)
  • pubmed_id - the PubMed ID associated with the experiment
  • researcher - name of the researcher who carried out the experiment
  • sample_id - composite name comprised of type, antibody, mutation, and date. Note: sample_id must be unique, and must not contain commas (,) or long dashes (–). Short dashes (-) are ok.
  • strain_background - the background yeast strain
  • strain_number - the number of the yeast strain
  • type - type of experiment (IP, input, mock)

Required Collections:

  1. Experiment type (eg. chip-seq, methylation array, etc)
  2. Experiment conditions
  3. Patient data
  4. Sample data
  5. bins - for map/reduce?
  6. Probe set information - for array data

TODO:

  • build indices on the database

Clone this wiki locally