Database

Purpose

This project includes a database implementation for the storage and analysis of epigenetics data. The goal is to produce an open-source database implementation and provide an open repository for the collection of public epigenetic data in a format that use relevant and useful. This does not suggest that the database will be a repository for raw data, but will hold processed and useful data sets.

Introduction

Conventions

To keep things consistent and searchable, several conventions must be followed for the key/value pairs that are entered into the database.

Keys must be entered in all lower case letters.
Chromosome names must be entered in all lower case letters, and prefixed with "chr"

Schema

The schema for this database will be relatively fluid, as it will be built in a MongoDB environment, with the expectation that sharding will be applied and the database scaled to run over a cluster of machines. However, neither sharding or a cluster will be a requirement - simply an implementation detail.

For data not yet released to the public, we may consider "hidden" flags to prevent the data from being visible. This will require a web interface/api that respects a hidden flag, and permissions appropriate to the task.

The current schema for the yeast epigentic ChIP-chip data is as follows:

waves Collection:

_id - automatically generated when a wave is added to the database, this attribute is an ObjectId.
pos - the chromosomal position of the wave
height - the height of the wave
chr - the chromosome name using roman numerals (ex. chrVII)
stddev - the standard deviation of the wave's gaussian distribution
sample_id - a reference to the _id of an entry in the samples Collection

samples Collection

_id - automatically generated when a wave is added to the database, this attribute is an ObjectId.
antibody - this is the antibody used in the ChIP-chip experiment
antibody_volume - the amount of antibody used
array_lot_number - the lot number of the array used
array_type - the type of array (ex. GeneChip S. cervisiae Tiling 1.0R Array)
catalog_number - catalog number of the array
comments - any comments regarding the sample
crosslinking_time - length of time used for crosslinking
exp_date - date of the experiment
file_name - name of the file that was used to import the waves
haswaves - true if the sample has associated waves in the waves Collection
hide - true if the sample should be hidden from web browser interface
input_file - path and name of the .wig file used to generate waves
make_wig - true or false if WaveGenerator created the wig file.
min_height - value of min_height used in the WaveGenerator
mutations - mutation state of the sample
number_waves - WaveGenerator setting on whether to display number next to wave
output_path - directory where WaveGenerator saved the .waves file
processor_threads - number of processors used by the WaveGenerator
protocol - which protocol was used in the experiment (ex. T7)
pubmed_id - the PubMed ID associated with the experiment
researcher - name of the researcher who carried out the experiment
sample_id - composite name comprised of type, antibody, mutation, and date. Note: sample_id must be unique, and must not contain commas (,) or long dashes (–). Short dashes (-) are ok.
strain_background - the background yeast strain
strain_number - the number of the yeast strain
type - type of experiment (IP, input, mock)

Required Collections:

Experiment type (eg. chip-seq, methylation array, etc)
Experiment conditions
Patient data
Sample data
bins - for map/reduce?
Probe set information - for array data

TODO:

build indices on the database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database

Database

Purpose

Introduction

Conventions

Schema

waves Collection:

samples Collection

Required Collections:

TODO:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally