Added fastq_filtrator.py#2
Conversation
krglkvrmn
left a comment
There was a problem hiding this comment.
Прекрасная работа, маленькие замечания ниже)
| @@ -0,0 +1,96 @@ | |||
| # This function returns GC-content of sequence in percents | |||
| def gc_content(read_seq): | |||
| return sum(map(read_seq.upper().count, ['G','C'])) / len(read_seq.strip()) * 100 | |||
There was a problem hiding this comment.
Прикольно, конечно, но для такой простой задачки как посчитать ГЦ-состав, это оверкил.
+ это не очень оптимально по производительности. count внутри себя делает полный проход по строке, и мы вызываем его два раза, хотя GC можно посчитать за один проход. Но это мелочи)
|
|
||
| # This function checks whether GC-content of sequence is in bounds | ||
| def gc_filter(read_seq, gc_bounds): | ||
| if isinstance(gc_bounds, (tuple, list)): |
| if gc_content(read_seq) >= min_gc and gc_content(read_seq) <= max_gc: # GC-content is in bounds | ||
| return 'Pass' | ||
| else: | ||
| return 'Fail' |
There was a problem hiding this comment.
В подобных случаях не стоит возвращать строки, для этого идеально подходят True и False. Тогда это ещё можно будет сильно упростить
| if gc_content(read_seq) >= min_gc and gc_content(read_seq) <= max_gc: # GC-content is in bounds | |
| return 'Pass' | |
| else: | |
| return 'Fail' | |
| return min_gc <= gc_content(read_seq) <= max_gc: # GC-content is in bounds |
| @@ -0,0 +1,96 @@ | |||
| # This function returns GC-content of sequence in percents | |||
| def gc_content(read_seq): | |||
| return sum(map(read_seq.upper().count, ['G','C'])) / len(read_seq.strip()) * 100 | |||
There was a problem hiding this comment.
Можно сделать strip сразу при прочтении рида, чтобы не писать его в каждой функции
| read_header = file.readline() ## I'm not stripping '\n' here consciously | ||
| read_seq = file.readline() ## for convenient appending to output files. | ||
| read_plus_str = file.readline() ## I'm going to take '\n' into account | ||
| read_qual = file.readline() ## when I'll be counting length of these strings. |
There was a problem hiding this comment.
Вот как раз вот тут можно вставить strip
| read_header = file.readline() ## I'm not stripping '\n' here consciously | |
| read_seq = file.readline() ## for convenient appending to output files. | |
| read_plus_str = file.readline() ## I'm going to take '\n' into account | |
| read_qual = file.readline() ## when I'll be counting length of these strings. | |
| read_header = file.readline().strip() ## I'm not stripping '\n' here consciously | |
| read_seq = file.readline().strip() ## for convenient appending to output files. | |
| read_plus_str = file.readline().strip() ## I'm going to take '\n' into account | |
| read_qual = file.readline().strip() ## when I'll be counting length of these strings. |
| if (gc_filter(read_seq, gc_bounds) == 'Pass' and | ||
| len_filter(read_seq, length_bounds) == 'Pass' and | ||
| qual_filter(read_qual, quality_threshold) == 'Pass'): | ||
| # Appending read to output files | ||
| add_read_to_passed(read_header, read_seq, | ||
| read_plus_str, read_qual, | ||
| output_file_prefix) | ||
| else: | ||
| if save_filtered == True: | ||
| add_read_to_failed(read_header, read_seq, | ||
| read_plus_str, read_qual, | ||
| output_file_prefix) |
Added fastq_filtrator.py (homework 2)