Dominant Script Strategy
========================

The Dominant Script Strategy is an efficient normalization approach that identifies the most frequently occurring Unicode script in a given text and applies normalization rules based on this dominant script. This strategy is particularly effective for texts predominantly written in a single script, such as Latin, Cyrillic, Arabic, or Greek.

Overview
--------

This strategy addresses the challenge of script-based homoglyph attacks by leveraging the fundamental principle that most legitimate text is predominantly written in a single Unicode script. By identifying the dominant script and applying script-specific normalization rules, the strategy can effectively restore homoglyphs to their canonical forms while maintaining the text's intended linguistic context.

The strategy is especially valuable for applications dealing with user-generated content, document processing, and security systems where homoglyph-based attacks might attempt to bypass filters or detection mechanisms.

Implementation Details
-----------------------

1. **Script Detection and Analysis**:
   The strategy begins by analyzing the Unicode script distribution across all characters in the input text. This analysis uses the `unicodedataplus` library to determine each character's script:

   .. code-block:: python

      def detect_dominant_script(text: str) -> str:
          script_counts = Counter(unicodedataplus.script(char) for char in text)
          total_count = sum(script_counts.values())
          dominant_script = max(script_counts.keys(), key=lambda k: script_counts[k])
          
          if script_counts[dominant_script] / total_count < 0.75:
              logging.warning(
                  f"The dominant script '{dominant_script}' comprises less than 75% of the total character count. "
                  f"This is unusual, as most texts predominantly consist of characters from a single script."
              )
          return dominant_script

   The algorithm calculates the frequency of each Unicode script and identifies the most common one. If the dominant script constitutes less than 75% of the total characters, a warning is logged, as this may indicate either mixed-script text or a potential attack attempt.

2. **Normalization Map Generation**:
   Once the dominant script is identified, the strategy retrieves a script-specific normalization map that defines appropriate replacements for homoglyphs:

   .. code-block:: python

      dominant_script = detect_dominant_script(text)
      normalization_map = replacer.get_normalization_map_for_script_block_and_category(
          script=dominant_script, **kwargs
      )

   The normalization map is generated by the `HomoglyphReplacer` instance, which contains comprehensive mappings of homoglyphs to their canonical forms within specific scripts. This ensures that replacements are contextually appropriate for the identified script.

3. **Character Translation and Replacement**:
   The final step applies the normalization map to the entire text using Python's efficient `str.translate()` method:

   .. code-block:: python

      def apply_dominant_script_strategy(replacer, text: str, **kwargs) -> str:
          if not text:
              logging.warning("Empty text provided for normalization")
              return ""
          
          if not replacer:
              raise ValueError("No replacer provided for normalization")
          
          dominant_script = detect_dominant_script(text)
          normalization_map = replacer.get_normalization_map_for_script_block_and_category(
              script=dominant_script, **kwargs
          )
          
          if not normalization_map:
              logging.warning(f"No normalization map available for script '{dominant_script}'")
              return text
          
          return text.translate(str.maketrans(normalization_map))

   The strategy includes comprehensive error handling and logging to ensure robust operation and debugging capabilities.

Example Usage
-------------

The following example demonstrates how to normalize text using the Dominant Script Strategy. This approach is highly effective for texts that are predominantly written in a single script:

.. code-block:: python

   from silverspeak.homoglyphs.normalization import apply_dominant_script_strategy
   from silverspeak.homoglyphs import HomoglyphReplacer

   # Text containing Cyrillic homoglyphs mixed with Latin
   text = "Examрle tеxt with sоme homoglурhs."  # Contains Cyrillic 'р', 'е', 'о', 'у'
   
   # Initialize the replacer
   replacer = HomoglyphReplacer()
   
   # Apply dominant script normalization
   normalized_text = apply_dominant_script_strategy(replacer, text)
   print(normalized_text)  # Output: "Example text with some homoglyphs."

This example shows how the strategy analyzes the script distribution and normalizes characters to match the dominant script (Latin in this case).

**Alternative Usage via normalize_text**:

.. code-block:: python

   from silverspeak.homoglyphs import normalize_text
   from silverspeak.homoglyphs.utils import NormalizationStrategies

   text = "Examрle tеxt with sоme homoglурhs."
   normalized_text = normalize_text(
       text, 
       strategy=NormalizationStrategies.DOMINANT_SCRIPT
   )
   print(normalized_text)

**Advanced Usage with Custom Parameters**:

.. code-block:: python

   # Preserve case during normalization
   normalized_text = apply_dominant_script_strategy(
       replacer, 
       text, 
       preserve_case=True
   )
   
   # Filter by specific Unicode category
   normalized_text = apply_dominant_script_strategy(
       replacer, 
       text, 
       category="Ll"  # Lowercase letters only
   )

Key Considerations
-------------------

**Strengths and Effectiveness:**

- **Computational Efficiency**: This strategy is highly efficient, requiring only a single pass through the text for script analysis and another for character replacement.
- **Script Consistency**: Ensures that the entire text maintains consistency with the dominant script, preventing mixed-script confusion.
- **Robust Detection**: The 75% threshold provides a reliable indicator for script dominance while allowing for reasonable variations.

**Limitations and Use Cases:**

- **Single Script Assumption**: This strategy assumes that legitimate text is predominantly written in a single script. It may not be suitable for genuinely multilingual documents.
- **Mixed Script Handling**: For texts with intentionally mixed scripts (e.g., technical documents with mathematical symbols), this strategy may be overly aggressive.
- **Script Ambiguity**: Some characters belong to multiple scripts or have ambiguous script assignments, which may affect detection accuracy.

**Performance Characteristics:**

- **Time Complexity**: O(n) where n is the length of the text, making it suitable for large documents.
- **Memory Usage**: Minimal memory overhead, primarily storing script counts and the normalization map.
- **Scalability**: Excellent scalability for high-volume text processing applications.

**Best Practices:**

- **Text Validation**: Check the warning logs to ensure the detected dominant script aligns with expectations.
- **Threshold Monitoring**: Monitor cases where the dominant script comprises less than 75% of characters, as these may indicate edge cases or attacks.
- **Error Handling**: Always handle cases where no normalization map is available for the detected script.
- **Testing**: Test with various script combinations to understand the strategy's behavior in your specific use case.

**Security Considerations:**

- **Attack Detection**: The 75% threshold can help identify potential homoglyph attacks where attackers mix scripts to evade detection.
- **False Positives**: Be aware that some legitimate multilingual content may trigger warnings.
- **Complementary Strategies**: Consider combining this strategy with others for comprehensive homoglyph detection and normalization.