Dominant Script Strategy ======================== The Dominant Script Strategy is an efficient normalization approach that identifies the most frequently occurring Unicode script in a given text and applies normalization rules based on this dominant script. This strategy is particularly effective for texts predominantly written in a single script, such as Latin, Cyrillic, Arabic, or Greek. Overview -------- This strategy addresses the challenge of script-based homoglyph attacks by leveraging the fundamental principle that most legitimate text is predominantly written in a single Unicode script. By identifying the dominant script and applying script-specific normalization rules, the strategy can effectively restore homoglyphs to their canonical forms while maintaining the text's intended linguistic context. The strategy is especially valuable for applications dealing with user-generated content, document processing, and security systems where homoglyph-based attacks might attempt to bypass filters or detection mechanisms. Implementation Details ----------------------- 1. **Script Detection and Analysis**: The strategy begins by analyzing the Unicode script distribution across all characters in the input text. This analysis uses the `unicodedataplus` library to determine each character's script: .. code-block:: python def detect_dominant_script(text: str) -> str: script_counts = Counter(unicodedataplus.script(char) for char in text) total_count = sum(script_counts.values()) dominant_script = max(script_counts.keys(), key=lambda k: script_counts[k]) if script_counts[dominant_script] / total_count < 0.75: logging.warning( f"The dominant script '{dominant_script}' comprises less than 75% of the total character count. " f"This is unusual, as most texts predominantly consist of characters from a single script." ) return dominant_script The algorithm calculates the frequency of each Unicode script and identifies the most common one. If the dominant script constitutes less than 75% of the total characters, a warning is logged, as this may indicate either mixed-script text or a potential attack attempt. 2. **Normalization Map Generation**: Once the dominant script is identified, the strategy retrieves a script-specific normalization map that defines appropriate replacements for homoglyphs: .. code-block:: python dominant_script = detect_dominant_script(text) normalization_map = replacer.get_normalization_map_for_script_block_and_category( script=dominant_script, **kwargs ) The normalization map is generated by the `HomoglyphReplacer` instance, which contains comprehensive mappings of homoglyphs to their canonical forms within specific scripts. This ensures that replacements are contextually appropriate for the identified script. 3. **Character Translation and Replacement**: The final step applies the normalization map to the entire text using Python's efficient `str.translate()` method: .. code-block:: python def apply_dominant_script_strategy(replacer, text: str, **kwargs) -> str: if not text: logging.warning("Empty text provided for normalization") return "" if not replacer: raise ValueError("No replacer provided for normalization") dominant_script = detect_dominant_script(text) normalization_map = replacer.get_normalization_map_for_script_block_and_category( script=dominant_script, **kwargs ) if not normalization_map: logging.warning(f"No normalization map available for script '{dominant_script}'") return text return text.translate(str.maketrans(normalization_map)) The strategy includes comprehensive error handling and logging to ensure robust operation and debugging capabilities. Example Usage ------------- The following example demonstrates how to normalize text using the Dominant Script Strategy. This approach is highly effective for texts that are predominantly written in a single script: .. code-block:: python from silverspeak.homoglyphs.normalization import apply_dominant_script_strategy from silverspeak.homoglyphs import HomoglyphReplacer # Text containing Cyrillic homoglyphs mixed with Latin text = "Examрle tеxt with sоme homoglурhs." # Contains Cyrillic 'р', 'е', 'о', 'у' # Initialize the replacer replacer = HomoglyphReplacer() # Apply dominant script normalization normalized_text = apply_dominant_script_strategy(replacer, text) print(normalized_text) # Output: "Example text with some homoglyphs." This example shows how the strategy analyzes the script distribution and normalizes characters to match the dominant script (Latin in this case). **Alternative Usage via normalize_text**: .. code-block:: python from silverspeak.homoglyphs import normalize_text from silverspeak.homoglyphs.utils import NormalizationStrategies text = "Examрle tеxt with sоme homoglурhs." normalized_text = normalize_text( text, strategy=NormalizationStrategies.DOMINANT_SCRIPT ) print(normalized_text) **Advanced Usage with Custom Parameters**: .. code-block:: python # Preserve case during normalization normalized_text = apply_dominant_script_strategy( replacer, text, preserve_case=True ) # Filter by specific Unicode category normalized_text = apply_dominant_script_strategy( replacer, text, category="Ll" # Lowercase letters only ) Key Considerations ------------------- **Strengths and Effectiveness:** - **Computational Efficiency**: This strategy is highly efficient, requiring only a single pass through the text for script analysis and another for character replacement. - **Script Consistency**: Ensures that the entire text maintains consistency with the dominant script, preventing mixed-script confusion. - **Robust Detection**: The 75% threshold provides a reliable indicator for script dominance while allowing for reasonable variations. **Limitations and Use Cases:** - **Single Script Assumption**: This strategy assumes that legitimate text is predominantly written in a single script. It may not be suitable for genuinely multilingual documents. - **Mixed Script Handling**: For texts with intentionally mixed scripts (e.g., technical documents with mathematical symbols), this strategy may be overly aggressive. - **Script Ambiguity**: Some characters belong to multiple scripts or have ambiguous script assignments, which may affect detection accuracy. **Performance Characteristics:** - **Time Complexity**: O(n) where n is the length of the text, making it suitable for large documents. - **Memory Usage**: Minimal memory overhead, primarily storing script counts and the normalization map. - **Scalability**: Excellent scalability for high-volume text processing applications. **Best Practices:** - **Text Validation**: Check the warning logs to ensure the detected dominant script aligns with expectations. - **Threshold Monitoring**: Monitor cases where the dominant script comprises less than 75% of characters, as these may indicate edge cases or attacks. - **Error Handling**: Always handle cases where no normalization map is available for the detected script. - **Testing**: Test with various script combinations to understand the strategy's behavior in your specific use case. **Security Considerations:** - **Attack Detection**: The 75% threshold can help identify potential homoglyph attacks where attackers mix scripts to evade detection. - **False Positives**: Be aware that some legitimate multilingual content may trigger warnings. - **Complementary Strategies**: Consider combining this strategy with others for comprehensive homoglyph detection and normalization.