Added fastq_filtrator.py by ibragimovaamina · Pull Request #2 · ibragimovaamina/Python_BI_2022

ibragimovaamina · 2022-10-13T19:54:24Z

Added fastq_filtrator.py (homework 2)

…t a tuple

…f input file

krglkvrmn

Прекрасная работа, маленькие замечания ниже)

krglkvrmn · 2022-10-23T19:42:29Z

@@ -0,0 +1,96 @@
+# This function returns GC-content of sequence in percents
+def gc_content(read_seq):
+    return sum(map(read_seq.upper().count, ['G','C'])) / len(read_seq.strip()) * 100


Прикольно, конечно, но для такой простой задачки как посчитать ГЦ-состав, это оверкил.

+ это не очень оптимально по производительности. count внутри себя делает полный проход по строке, и мы вызываем его два раза, хотя GC можно посчитать за один проход. Но это мелочи)

krglkvrmn · 2022-10-23T19:42:56Z

+
+# This function checks whether GC-content of sequence is in bounds 
+def gc_filter(read_seq, gc_bounds):
+    if isinstance(gc_bounds, (tuple, list)):


krglkvrmn · 2022-10-23T19:47:08Z

+    if gc_content(read_seq) >= min_gc and gc_content(read_seq) <= max_gc:    # GC-content is in bounds
+        return 'Pass'
+    else:
+        return 'Fail'


В подобных случаях не стоит возвращать строки, для этого идеально подходят True и False. Тогда это ещё можно будет сильно упростить

Suggested change

if gc_content(read_seq) >= min_gc and gc_content(read_seq) <= max_gc: # GC-content is in bounds

return 'Pass'

else:

return 'Fail'

return min_gc <= gc_content(read_seq) <= max_gc: # GC-content is in bounds

krglkvrmn · 2022-10-23T19:48:07Z

@@ -0,0 +1,96 @@
+# This function returns GC-content of sequence in percents
+def gc_content(read_seq):
+    return sum(map(read_seq.upper().count, ['G','C'])) / len(read_seq.strip()) * 100


Можно сделать strip сразу при прочтении рида, чтобы не писать его в каждой функции

krglkvrmn · 2022-10-23T19:49:56Z

+            read_header = file.readline()     ## I'm not stripping '\n' here consciously
+            read_seq = file.readline()        ## for convenient appending to output files.
+            read_plus_str = file.readline()   ## I'm going to take '\n' into account 
+            read_qual = file.readline()       ## when I'll be counting length of these strings.


Вот как раз вот тут можно вставить strip

Suggested change

read_header = file.readline() ## I'm not stripping '\n' here consciously

read_seq = file.readline() ## for convenient appending to output files.

read_plus_str = file.readline() ## I'm going to take '\n' into account

read_qual = file.readline() ## when I'll be counting length of these strings.

read_header = file.readline().strip() ## I'm not stripping '\n' here consciously

read_seq = file.readline().strip() ## for convenient appending to output files.

read_plus_str = file.readline().strip() ## I'm going to take '\n' into account

read_qual = file.readline().strip() ## when I'll be counting length of these strings.

krglkvrmn · 2022-10-23T19:53:31Z

+            if (gc_filter(read_seq, gc_bounds) == 'Pass' and
+                len_filter(read_seq, length_bounds) == 'Pass' and
+                qual_filter(read_qual, quality_threshold) == 'Pass'):
+            # Appending read to output files 
+                add_read_to_passed(read_header, read_seq,
+                                   read_plus_str, read_qual,
+                                   output_file_prefix)
+            else: 
+                if save_filtered == True:
+                    add_read_to_failed(read_header, read_seq,
+                                       read_plus_str, read_qual,
+                                       output_file_prefix)


Очень удобно, разделила логику

ibragimovaamina and others added 6 commits October 13, 2022 22:49

Added fastq_filtrator.py and empty README

97a198b

Minor improvement

377d71e

Handling case when gc_bounds or length_bounds parameter is a list, no…

af6dbec

…t a tuple

Minor improvements

98cd8ea

Handling case when there is a dangling newline character at the end o…

b5dc285

…f input file

Added README

169808a

krglkvrmn suggested changes Oct 23, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added fastq_filtrator.py#2

Added fastq_filtrator.py#2
ibragimovaamina wants to merge 6 commits into
mainfrom
homework_2

ibragimovaamina commented Oct 13, 2022

Uh oh!

krglkvrmn left a comment

Uh oh!

krglkvrmn Oct 23, 2022

Uh oh!

krglkvrmn Oct 23, 2022

Uh oh!

krglkvrmn Oct 23, 2022

Uh oh!

krglkvrmn Oct 23, 2022

Uh oh!

krglkvrmn Oct 23, 2022

Uh oh!

krglkvrmn Oct 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibragimovaamina commented Oct 13, 2022

Uh oh!

krglkvrmn left a comment

Choose a reason for hiding this comment

Uh oh!

krglkvrmn Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

krglkvrmn Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

krglkvrmn Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

krglkvrmn Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

krglkvrmn Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

krglkvrmn Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants