Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Get a Samsung OLED gaming monitor for just $350

    Qualcomm Snapdragon X2 Elite tops the Apple M5 in new test video

    Tapo’s 1440p Wi-Fi security cam is 42% off! Grab it now for $70

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025

      Saudia Arabia’s STC commits to five-year network upgrade programme with Ericsson

      December 18, 2025
    • Crypto

      Bernstein Discusses Bitcoin’s Weakest Bear Market Yet – “Nothing Broke”

      February 9, 2026

      Ethereum Price Hits Breakdown Target — But Is a Bigger Drop to $1,000 Coming?

      February 9, 2026

      Damex Secures MiCA CASP Licence, Establishing Its Position as a Tier-1 Digital Asset Institution in Europe

      February 9, 2026

      Bitget and BlockSec Introduce the UEX Security Standard, Setting a New Benchmark for Universal Exchanges

      February 9, 2026

      3 Meme Coins To Watch In The Second Week Of February 2026

      February 9, 2026
    • Technology

      Get a Samsung OLED gaming monitor for just $350

      February 10, 2026

      Qualcomm Snapdragon X2 Elite tops the Apple M5 in new test video

      February 10, 2026

      Tapo’s 1440p Wi-Fi security cam is 42% off! Grab it now for $70

      February 10, 2026

      This 8BitDo Retro wireless ‘mecha’ keyboard is just $63 today

      February 10, 2026

      Star power, AI jabs and Free Bird: Digiday’s guide to what was in and out at the Super Bowl

      February 10, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Show HN: I modeled the Voynich Manuscript with SBERT to test for structure
    Technology

    Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

    TechAiVerseBy TechAiVerseMay 18, 2025No Comments4 Mins Read4 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Show HN: I modeled the Voynich Manuscript with SBERT to test for structure
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

    📜 Voynich Manuscript Structural Analysis

    🔍 Overview

    This started as a personal challenge to figure out what modern NLP could tell us about the Voynich Manuscript — without falling into translation speculation or pattern hallucination. I’m not a linguist or cryptographer. I just wanted to see if something as strange as Voynichese would hold up under real language modeling: clustering, POS inference, Markov transitions, and section-specific patterns.

    Spoiler: it kinda did.

    This repo walks through everything — from suffix stripping to SBERT embeddings to building a lexicon hypothesis. No magic, no GPT guessing. Just a skeptical test of whether the manuscript has structure that behaves like language, even if we don’t know what it’s saying.


    🧠 Why This Matters

    The Voynich Manuscript remains undeciphered, with no agreed linguistic or cryptographic solution. Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork. This project offers a middle path — using computational linguistics to assess whether the manuscript encodes real, structured language-like behavior.


    📁 Project Structure

    /data/
      AB.docx                         # Full transliteration with folio/line tags
      voynichese/                     # Root word .txt files
      stripped_cluster_lookup.json    # Cluster ID per stripped root
      unique_stripped_words.json      # All stripped root forms
      voynich_line_clusters.csv       # Cluster sequences per line
    
    /scripts/
      cluster_roots.py                # SBERT clustering + suffix stripping
      map_lines_to_clusters.py        # Maps manuscript lines to cluster IDs
      pos_model.py                    # Infers grammatical roles from cluster behavior
      transition_matrix.py            # Builds and visualizes cluster transitions
      lexicon_builder.py              # Creates a candidate lexicon by section and role
      cluster_language_similarity.py  # (Optional) Compares clusters to real-world languages
    
    /results/
      Figure_1.png                    # SBERT clusters (PCA reduced)
      transition_matrix_heatmap.png  # Markov transition matrix
      cluster_role_summary.csv
      cluster_transition_matrix.csv
      lexicon_candidates.csv
    

    ✅ Key Contributions

    • Clustering of stripped root words using multilingual SBERT
    • Identification of function-word-like vs. content-word-like clusters
    • Markov-style transition modeling of cluster sequences
    • Folio-based syntactic structure mapping (Botanical, Biological, etc.)
    • Generation of a data-driven lexicon hypothesis table

    🔧 Preprocessing Choices

    One of the most important assumptions I made was about how to handle the Voynich words before clustering. Specifically: I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants. The goal was to isolate what looked like root forms that repeated with variation, under the assumption that these suffixes might be:

    • Phonetic padding
    • Grammatical particles
    • Chant-like or mnemonic repetition
    • Or… just noise

    This definitely improved the clustering behavior — similar stems grouped more tightly, and the transition matrix showed cleaner structural patterns. But it’s also a strong preprocessing decision that may have:

    • Removed actual morphological information
    • Disguised meaningful inflectional variants
    • Introduced a bias toward function over content

    So it’s not neutral — it helped, but it also shaped the results.
    If someone wants to fork this repo and re-run the pipeline without suffix stripping — or treat suffixes as their own token class — I’d be genuinely interested in the comparison.


    📈 Key Findings

    • Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
    • Cluster 3 has high diversity and flexible positioning — likely a root content class
    • Transition matrix shows strong internal structure, far from random
    • Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)

    🧬 Hypothesis

    The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.


    ▶️ How to Reproduce

    # 1. Install dependencies
    pip install -r requirements.txt
    
    # 2. Run each stage of the pipeline
    python scripts/cluster_roots.py
    python scripts/map_lines_to_clusters.py
    python scripts/pos_model.py
    python scripts/transition_matrix.py
    python scripts/lexicon_builder.py
    

    📊 Example Visualizations

    📌 Figure 1: SBERT cluster embeddings (PCA-reduced)

    📌 Figure 2: Transition Matrix Heatmap


    📌 Limitations

    • Cluster-to-word mappings are indirect — frequency estimates may overlap
    • Suffix stripping is heuristic and may remove meaningful endings
    • No semantic translation attempted — only structural modeling

    ✍️ Authors Note

    This project was built as a way to learn — about AI, NLP, and how far structured analysis can get you without assuming what you’re looking at. I’m not here to crack the Voynich. But I do believe that modeling its structure with modern tools is a better path than either wishful translation or academic dismissal.

    So if you’re here for a Rosetta Stone, you’re out of luck.

    If you’re here to model a language that may not want to be modeled — welcome.


    🤝 Contributions Welcome

    This project is open to extensions, critiques, and collaboration — especially from linguists, cryptographers, conlang enthusiasts, and computational language researchers.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleShow HN: Buckaroo – The data table UI for Notebooks
    Next Article Ditching Obsidian and building my own
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Get a Samsung OLED gaming monitor for just $350

    February 10, 2026

    Qualcomm Snapdragon X2 Elite tops the Apple M5 in new test video

    February 10, 2026

    Tapo’s 1440p Wi-Fi security cam is 42% off! Grab it now for $70

    February 10, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025660 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025250 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025148 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025111 Views
    Don't Miss
    Technology February 10, 2026

    Get a Samsung OLED gaming monitor for just $350

    Get a Samsung OLED gaming monitor for just $350 Image: Samsung To paraphrase a certain…

    Qualcomm Snapdragon X2 Elite tops the Apple M5 in new test video

    Tapo’s 1440p Wi-Fi security cam is 42% off! Grab it now for $70

    This 8BitDo Retro wireless ‘mecha’ keyboard is just $63 today

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Get a Samsung OLED gaming monitor for just $350

    February 10, 20263 Views

    Qualcomm Snapdragon X2 Elite tops the Apple M5 in new test video

    February 10, 20263 Views

    Tapo’s 1440p Wi-Fi security cam is 42% off! Grab it now for $70

    February 10, 20264 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.