-
Notifications
You must be signed in to change notification settings - Fork 1
Database
This project includes a database implementation for the storage and analysis of epigenetics data. The goal is to produce an open-source database implementation and provide an open repository for the collection of public epigenetic data in a format that use relevant and useful. This does not suggest that the database will be a repository for raw data, but will hold processed and useful data sets.
To keep things consistent and searchable, several conventions must be followed for the key/value pairs that are entered into the database.
- Keys must be entered in all lower case letters.
- Chromosome names must be entered in all lower case letters, and prefixed with "chr"
The schema for this database will be relatively fluid, as it will be built in a MongoDB environment, with the expectation that sharding will be applied and the database scaled to run over a cluster of machines. However, neither sharding or a cluster will be a requirement - simply an implementation detail.
For data not yet released to the public, we may consider "hidden" flags to prevent the data from being visible. This will require a web interface/api that respects a hidden flag, and permissions appropriate to the task.
The current schema for the yeast epigentic ChIP-chip data is as follows:
- _id - automatically generated when a wave is added to the database, this attribute is an ObjectId.
- pos - the chromosomal position of the wave
- height - the height of the wave
- chr - the chromosome name using roman numerals (ex. chrVII)
- stddev - the standard deviation of the wave's gaussian distribution
- sample_id - a reference to the _id of an entry in the samples Collection
- _id - automatically generated when a wave is added to the database, this attribute is an ObjectId.
- antibody - this is the antibody used in the ChIP-chip experiment
- antibody_volume - the amount of antibody used
- array_lot_number - the lot number of the array used
- array_type - the type of array (ex. GeneChip S. cervisiae Tiling 1.0R Array)
- catalog_number - catalog number of the array
- comments - any comments regarding the sample
- crosslinking_time - length of time used for crosslinking
- exp_date - date of the experiment
- file_name - name of the file that was used to import the waves
- haswaves - true if the sample has associated waves in the waves Collection
- hide - true if the sample should be hidden from web browser interface
- input_file - path and name of the .wig file used to generate waves
- make_wig - true or false if WaveGenerator created the wig file.
- min_height - value of min_height used in the WaveGenerator
- mutations - mutation state of the sample
- number_waves - WaveGenerator setting on whether to display number next to wave
- output_path - directory where WaveGenerator saved the .waves file
- processor_threads - number of processors used by the WaveGenerator
- protocol - which protocol was used in the experiment (ex. T7)
- pubmed_id - the PubMed ID associated with the experiment
- researcher - name of the researcher who carried out the experiment
- sample_id - composite name comprised of type, antibody, mutation, and date. Note: sample_id must be unique, and must not contain commas (,) or long dashes (–). Short dashes (-) are ok.
- strain_background - the background yeast strain
- strain_number - the number of the yeast strain
- type - type of experiment (IP, input, mock)
- Experiment type (eg. chip-seq, methylation array, etc)
- Experiment conditions
- Patient data
- Sample data
- bins - for map/reduce?
- Probe set information - for array data
- build indices on the database