Skip to content

Incorrect splitting into sentences #39

@antonvasilev52

Description

@antonvasilev52

Hi all! I am talking about this regular expression, which is later used to coun sentences:

SENTENCE_REGEX = /[^\.!?\s][^\.!?]*(?:[\.!?](?!['"]?\s|$)[^\.!?]*)*[\.!?]?['"]?(?=\s|$)/

For texts like "Mr. Smith is a doctor" this will give two sentences: ["Mr.", "Smith is a doctor"] resulting in incorrect readability scores.
Maybe there is a way to improve it and exclude some common titles (such as "Mr" or "Dr") from this regular expression?

I am not very good at using scan method but if we use split we can probably use an expression similar to this:

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|(\."))\s

which is also not at all perfect because it will catch "Mr." and "Dr." but not "Mrs." (still better than nothing ☺ ).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions