• Banno
    29.3k
    I notice you didn't wait for an answer and just went ahead and did it. So preoccupied with whether or not you could, you didn't stop for more than a moment to think if you should (to paraphrase Michael Crichton).Jamal
    It's true. The scientist in me got the better of my moral compass.

    So it's reading all your posts before it comes up with an answer each time?Jamal
    No, I built a JSON file and converted it to embeddings, then created a FAISS index from that... this took a while. The question is converted to an embedding vector, then quickly finds the ten (up from 4) most similar posts for the LLM. 384 vectors per post.

    Or so I'm told...

    The results are slowly improving. Still pretty chunky. Playing with the wording of the prompt has made some improvement,

    In my present immoral state, I'll tempt you to do something like this with the entire data file... a master philosophy forum bot...

    Happy to share what I've understood of the process.
  • Jamal
    11.3k
    In my present immoral state, I'll tempt you to do something like this with the entire data file... a master philosophy forum bot...Banno

    I vaguely recall we were warned about this around 1818. Something to do with an unconventional Swiss scientist.
  • Banno
    29.3k
    Yes. I rather think my part in this is Igor.

    Asked to summarise what it did, ChatGPT said:
    Reveal
    I helped you turn a complete export of your forum posts into a fully searchable, locally hosted system, culminating in a personal chatbot, BannoBot, capable of answering questions based on your own writing. First, we converted your raw text file into structured JSON, extracting discussions and comments, and ensured it was readable by Python. We then loaded this JSON into an SQLite database, allowing easy querying and verification. To make the content searchable via a web interface, we built a static HTML page that queries the JSON using JavaScript, serving it through a local HTTP server on your Mac, so you could open it in Safari without complex setup.

    Next, we preprocessed the posts by generating embeddings. Using the MiniLM-L6-v2 model, each post was converted into a 384-dimensional numerical vector capturing its semantic meaning. These vectors were stored locally as embeddings.npy . To enable fast similarity search, we built a FAISS index from the embeddings, allowing the bot to retrieve only the most relevant posts for a user query rather than scanning all 29,918 posts each time.

    We then integrated the BannoBot script, which takes a user’s question, converts it into an embedding, searches the FAISS index for top-k relevant posts, and constructs a prompt including these excerpts. This prompt is passed to a local LLM (Orca-Mini), which generates a natural-language answer in the style and content of your posts. All processing—embedding, search, and LLM inference—occurs on your laptop, ensuring privacy and avoiding cloud APIs.
    Finally, we tested the system end-to-end: verifying the JSON, checking embeddings and index consistency, and running the bot interactively. Optional improvements include richer prompts and a GUI wrapper for easier interaction. The result is a fully local, private chatbot that reads your posts semantically and produces coherent answers, effectively turning your entire post archive into an intelligent, searchable personal assistant.
  • Jamal
    11.3k
    Next, we preprocessed the posts by generating embeddings. Using the MiniLM-L6-v2 model, each post was converted into a 384-dimensional numerical vector capturing its semantic meaning. These vectors were stored locally as embeddings.npy . To enable fast similarity search, we built a FAISS index from the embeddings, allowing the bot to retrieve only the most relevant posts for a user query rather than scanning all 29,918 posts each time.

    We then integrated the BannoBot script, which takes a user’s question, converts it into an embedding, searches the FAISS index for top-k relevant posts, and constructs a prompt including these excerpts. This prompt is passed to a local LLM (Orca-Mini), which generates a natural-language answer in the style and content of your posts. All processing—embedding, search, and LLM inference—occurs on your laptop, ensuring privacy and avoiding cloud APIs.

    Cool. I wonder though...

    I'll tempt you to do something like this with the entire data file... a master philosophy forum bot...Banno

    Like an amalgam of everyone on TPF? Tempting though it is, my instinct is that the wider the range of data, the more it would approach a regular LLM like ChatGPT and lose its own point of view.
  • Banno
    29.3k
    Like an amalgam of everyone on TPF?Jamal

    "My name is Legion, for we are many..."

    Given a well-constructed prompt, would it be able to present a range of views?

    It woudl still be very small compared with the commercial sites.
  • Hanover
    14.8k
    I have asked ChatGpt to mimic me based upon its interactions with me, but it does a poor job. I'm much funnier. If @Banno wants a convincing mimic, no need to turn towards AI. I could do a fine job, but I'd quickly be outed. Again, I'm much funnier.
  • Banno
    29.3k
    What is frightening is that I, with ancient coding skills and precious little understanding, was able to built the bot sitting in my lounge chair over a cup of tea and some nice fruit cake on a lazy Sunday morning.

    Something far-reaching has happened with AI.
  • Jamal
    11.3k


    All very interesting but we've gone too far off-topic. Since this is an important topic I'd like to keep it pure.
  • Hanover
    14.8k
    Like an amalgam of everyone on TPF? Tempting though it is, my instinct is that the wider the range of data, the more it would approach a regular LLM like ChatGPT and lose its own point of view.Jamal

    You believe the posters here represent a cross section of the public at large?
  • Jamal
    11.3k
    Approach, I said.

    New topic if you like.
  • Hanover
    14.8k
    That post was a cross-post, posted before you chastised me. :up:
  • Banno
    29.3k
    Fair.

    Anyone interested in continuing, PM me.

    In any case it seems that the result will not get past a few simple lines... Orca Mini can't cut it. Looking at alternatives. And an offer to buy me out from North Korea.
  • Jamal
    11.3k
    Google has already begun to index tpfarchive.com, no doubt because we've linked to it here.

    I forgot to remove the short stories. On the one hand they should be archived, on the other hand some of us (like me and @hypericin) don't want our stories to be publicly available, i.e., published, on the web.
  • Jamal
    11.3k
    The above issue alerted me to another one: the possibility of SEO penalization for duplicate content. So I've now blocked crawlers with meta tags, a robots.txt file, and also in the web server config, thus preventing search engines from indexing any more pages on tpfarchive.com. I'll remove those blocks when we close the Plush site.
  • hypericin
    2k
    Please don't remove them, @Jamal! I like that there is a permanent repository for the stories on the web. I might lose my copies someday, then they will be gone forever. I don't know if you still care, but I actually prefer that they are publicly available at this point. Easier to share with people.
  • Jamal
    11.3k


    OK, I'll leave them. I've found a decent compromise, which is that they won't be indexed by search engines but anyone can visit those pages, given the URLs (or find them from the archive's search page or whatever).
  • hypericin
    2k
    Great, nice compromise.
  • Outlander
    2.9k
    OK, I'll leave them. I've found a decent compromise, which is that they won't be indexed by search engines but anyone can visit those pages, given the URLs (or find them from the archive's search page or whatever).Jamal

    Or, you can just copy and paste them (along with relevant information such as the particular contest name and type, date of publication, user vote data, and author, if available) to a new, locked read-only sub forum on the upcoming new site? Unless the comments or reviews are of such dire importance. :snicker:

    Or would that result in some sort of unfortunate gray area as far as the legal standards and obligations you're seeking to meet and fulfill are concerned? :chin:

    If so, perhaps there can be a further compromise such as a single pinned thread in some sub forum that links every single short story/poem/literature from every competition that links to its respective content on the archive site? Or would even that be disallowed?
  • Jamal
    11.3k


    I'm not migrating any posts to the new site, so the current compromise is fine. As far as I recall it's only me and @hypericin who ever cared anyway.
  • Jamal
    11.3k
    The one big improvement I could make to the archive is to include the categories. For some reason Plush gave me an export without categories so I just went ahead and built the archive without them. Now I've got them so I could potentially incorporate them somehow. Maybe just in the search and your-posts pages.
  • Baden
    16.7k
    The one big improvement I could make to the archive is to include the categories.Jamal

    Please do. :up:
  • Outlander
    2.9k
    The one big improvement I could make to the archive is to include the categories.Jamal

    I'll give you one hint.

    To retrieve the category automatically from the HTML content of the topic URL, use this Regular Expression (which I'm sure Perl supports):

    Reveal
    <li class="(?:.+?) Active"><strong><a href="(.+?)">(.+?)</a></strong></li>
    


    Item #1 will of course contain the fully matched string, Item #2 will contain the URL structure (ie. "/categories/1/general-philosophy") and, the magic Item #3 will contain the Category Name (ie. "General Philosophy").

    Now all you have to do is fetch and retrieve 13,000 webpages and sort through them. Not as daunting as it might sound, but you should know that. :smile:

    (Unless of course the "dump" or rawest form of database storage you can access has some field or notation for category, in which case it'd be even easier!)
  • Jamal
    11.3k
    (Unless of course the "dump" or rawest form of database storage you can access has some field or notation for category, in which case it'd be even easier!)Outlander

    Yep. I've done it already. :up:

    (not live yet)
  • Jamal
    11.3k


    Done: categories can now be selected on the search page. Plus you can run a search without entering a search term to get all the discussions in a category. If it doesn't work maybe you need to clear your cached files.
  • Baden
    16.7k


    Nice. :cool: :up:
  • Leontiskos
    5.5k
    If you're interested, the main software requirements to accommodate the new laws are more configuration, crucially including the configuration of the sign-up form, and more moderation tools, crucially including the flagging and moderation of direct messages.Jamal

    Interesting. That's the sort of thing I was wondering about. I know it can also get tricky running an international website.

    I notice you didn't wait for an answer and just went ahead and did it. So preoccupied with whether or not you could, you didn't stop for more than a moment to think if you should (to paraphrase Michael Crichton).Jamal

    What you point to here is significant beyond belief.
  • Leontiskos
    5.5k
    When the sidebar is collapsed it's pretty distraction-free, no?Jamal

    Yes, I think so. That ability to collapse the sidebar is what I was thinking of. :up:Leontiskos

    Just to note, I've learned that there is an undisclosed shortcut to hide/show the Discourse sidebar ('=').
  • Banno
    29.3k
    Dare I ask, is there to be a treatment for private messages?

    There's some good stuff in my inbox, from various members. It runs to twelve pages.
12345Next
bold
italic
underline
strike
code
quote
ulist
image
url
mention
reveal
youtube
tweet
Add a Comment

Welcome to The Philosophy Forum!

Get involved in philosophical discussions about knowledge, truth, language, consciousness, science, politics, religion, logic and mathematics, art, history, and lots more. No ads, no clutter, and very little agreement — just fascinating conversations.