Dominant Script Strategy

The Dominant Script Strategy is an efficient normalization approach that identifies the most frequently occurring Unicode script in a given text and applies normalization rules based on this dominant script. This strategy is particularly effective for texts predominantly written in a single script, such as Latin, Cyrillic, Arabic, or Greek.

Overview

This strategy addresses the challenge of script-based homoglyph attacks by leveraging the fundamental principle that most legitimate text is predominantly written in a single Unicode script. By identifying the dominant script and applying script-specific normalization rules, the strategy can effectively restore homoglyphs to their canonical forms while maintaining the text’s intended linguistic context.

The strategy is especially valuable for applications dealing with user-generated content, document processing, and security systems where homoglyph-based attacks might attempt to bypass filters or detection mechanisms.

Implementation Details

Script Detection and Analysis: The strategy begins by analyzing the Unicode script distribution across all characters in the input text. This analysis uses the unicodedataplus library to determine each character’s script:

def detect_dominant_script(text: str) -> str:
    script_counts = Counter(unicodedataplus.script(char) for char in text)
    total_count = sum(script_counts.values())
    dominant_script = max(script_counts.keys(), key=lambda k: script_counts[k])

    if script_counts[dominant_script] / total_count < 0.75:
        logging.warning(
            f"The dominant script '{dominant_script}' comprises less than 75% of the total character count. "
            f"This is unusual, as most texts predominantly consist of characters from a single script."
        )
    return dominant_script

The algorithm calculates the frequency of each Unicode script and identifies the most common one. If the dominant script constitutes less than 75% of the total characters, a warning is logged, as this may indicate either mixed-script text or a potential attack attempt.

Normalization Map Generation: Once the dominant script is identified, the strategy retrieves a script-specific normalization map that defines appropriate replacements for homoglyphs:
```
dominant_script = detect_dominant_script(text)
normalization_map = replacer.get_normalization_map_for_script_block_and_category(
    script=dominant_script, **kwargs
)
```
The normalization map is generated by the HomoglyphReplacer instance, which contains comprehensive mappings of homoglyphs to their canonical forms within specific scripts. This ensures that replacements are contextually appropriate for the identified script.

Character Translation and Replacement: The final step applies the normalization map to the entire text using Python’s efficient str.translate() method:

def apply_dominant_script_strategy(replacer, text: str, **kwargs) -> str:
    if not text:
        logging.warning("Empty text provided for normalization")
        return ""

    if not replacer:
        raise ValueError("No replacer provided for normalization")

    dominant_script = detect_dominant_script(text)
    normalization_map = replacer.get_normalization_map_for_script_block_and_category(
        script=dominant_script, **kwargs
    )

    if not normalization_map:
        logging.warning(f"No normalization map available for script '{dominant_script}'")
        return text

    return text.translate(str.maketrans(normalization_map))

The strategy includes comprehensive error handling and logging to ensure robust operation and debugging capabilities.

Example Usage

The following example demonstrates how to normalize text using the Dominant Script Strategy. This approach is highly effective for texts that are predominantly written in a single script:

from silverspeak.homoglyphs.normalization import apply_dominant_script_strategy
from silverspeak.homoglyphs import HomoglyphReplacer

# Text containing Cyrillic homoglyphs mixed with Latin
text = "Examрle tеxt with sоme homoglурhs."  # Contains Cyrillic 'р', 'е', 'о', 'у'

# Initialize the replacer
replacer = HomoglyphReplacer()

# Apply dominant script normalization
normalized_text = apply_dominant_script_strategy(replacer, text)
print(normalized_text)  # Output: "Example text with some homoglyphs."

This example shows how the strategy analyzes the script distribution and normalizes characters to match the dominant script (Latin in this case).

Alternative Usage via normalize_text:

from silverspeak.homoglyphs import normalize_text
from silverspeak.homoglyphs.utils import NormalizationStrategies

text = "Examрle tеxt with sоme homoglурhs."
normalized_text = normalize_text(
    text,
    strategy=NormalizationStrategies.DOMINANT_SCRIPT
)
print(normalized_text)

Advanced Usage with Custom Parameters:

# Preserve case during normalization
normalized_text = apply_dominant_script_strategy(
    replacer,
    text,
    preserve_case=True
)

# Filter by specific Unicode category
normalized_text = apply_dominant_script_strategy(
    replacer,
    text,
    category="Ll"  # Lowercase letters only
)

Key Considerations

Strengths and Effectiveness:

Computational Efficiency: This strategy is highly efficient, requiring only a single pass through the text for script analysis and another for character replacement.
Script Consistency: Ensures that the entire text maintains consistency with the dominant script, preventing mixed-script confusion.
Robust Detection: The 75% threshold provides a reliable indicator for script dominance while allowing for reasonable variations.

Limitations and Use Cases:

Single Script Assumption: This strategy assumes that legitimate text is predominantly written in a single script. It may not be suitable for genuinely multilingual documents.
Mixed Script Handling: For texts with intentionally mixed scripts (e.g., technical documents with mathematical symbols), this strategy may be overly aggressive.
Script Ambiguity: Some characters belong to multiple scripts or have ambiguous script assignments, which may affect detection accuracy.

Performance Characteristics:

Time Complexity: O(n) where n is the length of the text, making it suitable for large documents.
Memory Usage: Minimal memory overhead, primarily storing script counts and the normalization map.
Scalability: Excellent scalability for high-volume text processing applications.

Best Practices:

Text Validation: Check the warning logs to ensure the detected dominant script aligns with expectations.
Threshold Monitoring: Monitor cases where the dominant script comprises less than 75% of characters, as these may indicate edge cases or attacks.
Error Handling: Always handle cases where no normalization map is available for the detected script.
Testing: Test with various script combinations to understand the strategy’s behavior in your specific use case.

Security Considerations:

Attack Detection: The 75% threshold can help identify potential homoglyph attacks where attackers mix scripts to evade detection.
False Positives: Be aware that some legitimate multilingual content may trigger warnings.
Complementary Strategies: Consider combining this strategy with others for comprehensive homoglyph detection and normalization.