Custom-trained GPT-3 / GPT-4 – Vector database with vector search

Say you have a bunch of company-specific documents/pages/pdfs that you want a user to be able to query with GPT-3.

Without actually pre-training GPT-3, you can get an effect that is pretty close by using embeddings.

Note: Code is available on github: https://github.com/johnflux/gpt3_search

Here’s how:

  1. Ahead of time, generate an embedding vector for each of your company-specific document / page / pdf . You can even break up large documents and generate an embedding vector for each chunk individually.

    For example for a medical system:

    Gastroesophageal reflux disease (GERD) occurs when stomach acid repeatedly flows back into the tube connecting your mouth and stomach (esophagus). This backwash (acid reflux) can irritate the lining of your esophagus.
    Many people experience acid reflux from time to time. However, when acid reflux happens repeatedly over time, it can cause GERD.
    Most people are able to manage the discomfort of GERD with lifestyle changes and medications. And though it’s uncommon, some may need surgery to ease symptoms.

    [0.22, 0.43, 0.21, 0.54, 0.32……]
  2. When a user asks a question, generate an embedding vector for that too:
    “I’m getting a lot of heartburn.” or
    “I’m feeling a burning feeling behind my breast bone”
    or
    “I keep waking up with a bitter tasting liquid in my mouth”

    [0.25, 0.38, 0.24, 0.55, 0.31……]
  3. Compare the query vector against all the document vectors and find which document vector is the closest (cosine or manhatten is fine). Show that to the user:

    User: I'm getting a lot of heartburn.
    Document: Gastroesophageal reflux disease (GERD) occurs when stomach acid repeatedly flows back into the tube connecting your mouth and stomach (esophagus). This backwash (acid reflux) can irritate the lining of your esophagus.
    Many people experience acid reflux from time to time. However, when acid reflux happens repeatedly over time, it can cause GERD.
    Most people are able to manage the discomfort of GERD with lifestyle changes and medications. And though it's uncommon, some may need surgery to ease symptoms.


    Pass that document and query to GPT-3, and ask it to reword it to fit the question:

Implementation code

Note: Code is available on github: https://github.com/johnflux/gpt3_search


Generate embeddings

Here’s an example in node javascript:

const { Configuration, OpenAIApi } = require('openai');
const fs = require('fs');

const configuration = new Configuration({
  apiKey: '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',
});
const openai = new OpenAIApi(configuration);

/** Generates embeddings for all guides in the database
 *  Then run:
 *   node generate_embeddings.js ./prod_backup_20230119.json > embeddings.json
 *   node embeddings_search.js ./embeddings.json
 */

var filenames = process.argv.slice(2);

if (filenames.length === 0) {
  console.log('Usage: node generate_embeddings_from_txt_files.js *.txt > embeddings.json');
  process.exit(1);
}

async function run() {
  const embeddings = [];
  for (const filename of filenames) {
    const input = fs.readFileSync(filename, 'utf8');

    try {
      const response = await openai.createEmbedding({ input, model: 'text-embedding-ada-002' });
      const output = { embedding: response.data.data[0].embedding };
      embeddings.push(output);
      console.error('Success:', filename);
    } catch (e) {
      console.error('Failed:', filename);
    }
  }
  console.log(JSON.stringify(embeddings, null, 2));
}

run();

This outputs a file like:

[
{"filename": "gerd.txt", "embedding": [-0.021665672, 0.00097308296, 0.027932819, -0.027959095,....<snipped>]},
....
]

Do search

const { Configuration, OpenAIApi } = require('openai');

const configuration = new Configuration({
  apiKey: 'YOUR-API-KEY',
});
const openai = new OpenAIApi(configuration);

var jsonfilename = './embeddings.json';
const searchTerm = process.argv.slice(2).join(' ');

if (!searchTerm) {
  console.log('Usage: node embeddings_search.js diabetes referral guide');
  process.exit(1);
}

const data = require('./' + jsonfilename);

async function run() {
  const response = await openai.createEmbedding({ input: searchTerm, model: 'text-embedding-ada-002' });
  const embedding = response.data.data[0].embedding;
  const results = getScores(data, embedding);

  console.log(results.map((a) => `${a.score.toFixed(2)}: ${a.filename}]`).join('\n'));
}
run();

function getScores(data, embedding) {
  const results = data
    .map((doc) => ({ score: cosinesim(doc.embedding, embedding), doc }))
    .sort((a, b) => b.score - a.score)
    .filter((doc, index) => index < 3 || (index < 10 && doc.score > 0.7));

  const titles = results.map((doc) => doc.doc.title);
  const results_uniq = results.filter((x, index) => titles.indexOf(x.doc.title) === index);
  return results_uniq;
}

function cosinesim(A, B) {
  var dotproduct = 0;
  var mA = 0;
  var mB = 0;
  for (let i = 0; i < A.length; i++) {
    dotproduct += A[i] * B[i];
    mA += A[i] * A[i];
    mB += B[i] * B[i];
  }
  mA = Math.sqrt(mA);
  mB = Math.sqrt(mB);
  var similarity = dotproduct / (mA * mB);
  return similarity;
}

You can now run like:

./node embeddings_search.js I am getting a lot of heartburn.

And it will output the filename of the closest document. You can display this directly the user user, or pass it to GPT-3 to fine tune the output.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s