Frequently Asked Questions

BioForging is designed for a wide spectrum of users in the biosciences field: from researchers and scientists in laboratories who need powerful genomic analysis tools, to undergraduate and graduate students who are learning bioinformatics or other biosciences.

The Free version of BioForging offers a solid core of tools, ideal for your analyses and research projects. For researchers looking to go further, BioForging Premium unlocks our complete arsenal, giving you access to advanced analysis modules, specialized algorithms, and extended visualization capabilities.

BioForging is cross-platform. It runs natively on Windows, macOS, and Linux so you can research regardless of the operating system you use.

Our platform allows you to perform genomic and proteomic analysis, 3D molecular visualization, primer design and more, all from an intuitive interface.

We use state-of-the.art encryption and comply with international security standards to ensure your information and your lab's data is always protected.

Yes, the Electronic Laboratory Notebook (ELN) is designed for teamwork, allowing you to assign tasks, share protocols, and visualize the progress of every experiment centrally.

For BioForging's Artificial Intelligence to be accurate, it doesn't matter if you are studying Kinases, Proteases, G Protein-Coupled Receptors (GPCRs), or Ion Channels. A mathematical model is only as good as the quality of the data it consumes.

This guide will explain how to build, curate, and assemble a perfect CSV file to train high-precision sessions for any therapeutic target.

Step 1: Obtain the Actives

The core of your dataset should be compounds scientifically known to modulate your target protein (inhibitors, agonists, antagonists, etc.).

The Ideal Source: Use high-quality public databases like ChEMBL, PubChem, or BindingDB. Download the data by searching for the official name of your protein.
Filter by Affinity: Filter only those compounds with high-potency measurements (e.g., IC50 < 100 nM, Kd < 100 nM, or Ki < 100 nM). Do not clutter your dataset with "mediocre" compounds that barely interact with the protein.
Structural Cleaning: Remove any row from the file that does not have a valid SMILES format, or extremely rare laboratory molecules containing heavy metals.

Step 2: Force Structural Diversity

If out of your 3000 active compounds, 2500 are exact derivatives of a single famous drug (only changing one carbon atom), the AI will become "lazy". It will memorize that unique skeleton and reject everything else.

What to do? Ensure your database has representatives from different chemical classes.
Example: If you study a kinase, make sure to have compounds that bind to the ATP active site (Type I), but also include allosteric inhibitors (Type II or III) that have radically different structures. This forces the AI to learn shared abstract chemical patterns (Scaffold-Hopping) instead of memorizing the shape of a single pill.

Step 3: Inject Real Negatives and "Traps" (Hard Decoys)

If you only give the AI compounds that work, the neural network will naively assume that any molecule in the universe sharing some of those pieces (like a simple benzene ring) also works for your protein.

The BSI engine is mathematically designed to exploit structural differences. Therefore, you must manually add negative controls labeled 0 (Inactive) to the end of your CSV, divided into two essential categories:

Known Experimental Negatives: Search the same databases for compounds that were tested against your protein but showed no activity (e.g., compounds with IC50 > 10,000 nM or measurements reported as "Inactive").
Why are they invaluable? Many of these negatives are direct structural analogs of your actives (only a methyl group changes, the position of a nitrogen, etc.). By including an active analog (label 1) and its inactive analog (label 0), the AI learns the exact precise topology of the binding pocket and discovers which critical atomic changes destroy affinity.
General Pharmacology Traps (Hard Decoys): These are completely unrelated inactive compounds that serve to prevent the neural network from overestimating famous fragments. Use the following logic to build your universal traps:
- Halogen Traps (Fluorine/Chlorine): Many modern drugs use fluorine. Add common fluorinated inactives (e.g., Fluoxetine/Prozac).
- Nitrogen Density Traps: Add compounds with nitrogen rings (e.g., Methotrexate, Caffeine, Viagra). Prevent the AI from blindly associating nitrogen with affinity.
- Simplicity Traps: Add simple scaffolds (e.g., Aspirin, Ibuprofen).
- Steric Traps (Extreme Size): Add super fatty and large molecules (Cholesterol) and giant cyclic molecules (Erythromycin). This teaches the AI to respect the size limits of the protein pocket.

Step 4: Format the CSV File

Your final CSV file must be perfectly structured for BioForging to process it without errors.

The minimum and essential columns the CSV must have are:

chembl_id (or any textual ID to identify the row, like Mol_001).
smiles (The exact chemical structure of the molecule).
label (This is the key to all training: 1 for your downloaded actives, and 0 for the general pharmacology traps you added).

(Additional columns for IC50, units, or names are useful for the researcher, but BioForging's BSI neural network only needs to look at the SMILES and the Label).

This guide is for when you've trained the BSI model using a dataset focused on a single protein (like HIV Integrase) or a closely related protein family.

Unlike traditional methods that only look at physical similarities between molecules, the Bioactivity Similarity Index (BSI) searches for a drug's "biological profile." Because of this, interpreting the results requires a different approach.

1. Making sense of "Green" matches (Bioactivity Certainty)

In the BSI system, a higher percentage of "greens" means the neural network is highly confident that your molecule is a real inhibitor. The model is essentially recognizing the active profile of your molecule across all those reference compounds.

Here is a practical breakdown of those percentages:

Under 3% (Statistical Noise / Inactive): This is the critical cutoff. If your candidate only triggers 30 to 80 greens out of a 3000-compound database, it should be considered inactive. "Decoy" molecules (like aspirin) will always trigger 1% or 2% simply due to minor mathematical coincidences. If it doesn't break the 3% barrier, the compound has likely failed.
Between 5% and 15% (The Novelty Range): The model firmly believes the compound is active, but its biological profile only matches a very specific subset of known inhibitors. This is usually an excellent candidate if you are looking for entirely new chemical scaffolds.
Between 15% and 50% (The Ideal Range): This is the sweet spot. Your molecule lit up a significant portion of all known drugs for that protein. The AI is extremely confident in its potential.
Over 90% (Promiscuous Range): Be careful here. If your compound lights up almost everything in the table, you're likely looking at a pan-assay interference compound (PAINS). It's a "sticky" molecule that will react with almost anything and is likely to be toxic.

Quality Control Tip: Since the inactive compounds you used for training (label=0) are hidden from your final results table, it's good practice to run a reverse test to rule out false positives. Manually enter the SMILES of any decoy molecule. If your model is robust, that decoy should fall squarely into the noise range (under 3%).

2. Spotting the perfect discovery: BSI vs. Tanimoto

Imagine you have two candidates, A and B, both scoring an excellent 35% in BSI. Which one should you actually synthesize and test? The tiebreaker comes from comparing the BSI score against the Tanimoto structural similarity score.

The winning combo (High BSI + Low Tanimoto): This is where true discovery happens (Scaffold-Hopping). If your compound breaks 15% BSI but its physical similarity (Tanimoto) to known drugs is tiny (e.g., 0.08 to 0.15), you've found a completely novel chemical structure that biologically promises to do the same job as the market's most potent drugs. This is a highly patentable candidate.
The safe but predictable candidate (High BSI + High Tanimoto): If Candidate B has a high BSI but also a very high Tanimoto (e.g., 0.85), you're looking at a clone or a close derivative of a drug already in your database. It's still a good inhibitor, but structurally it doesn't bring anything new to the table and is likely already patented.

This guide is designed for when you train the BSI model using a massive dataset covering several proteins or entire families (e.g., a database with 10,000 active compounds against Kinases A, B, C, and D).

When working with multiple targets, the BSI network acts as a strict evaluator of affinity and toxicity. Your primary goal shifts here: you're no longer just trying to hit your target protein, but also ensuring the compound completely ignores everything else.

1. Interpreting Bioactive Selectivity (The "Green" Rule)

Unlike single-target models where you want as many greens as possible, in a massive (Pan-Target) environment, the percentage of greens tells you how selective or promiscuous your compound is.

Assuming a dataset with thousands of drugs distributed across many proteins, here is how to interpret the percentages:

Extreme Selectivity (1% to 5% of the total dataset): This is the perfect scenario. Your candidate lit up only the inhibitors for your protein of interest, leaving everything else at a strict 0%. The model is assuring you that the compound's biological profile is specific and lethal only to that target.
Dual or Triple-Action Profile (5% to 10% of the total dataset): Your compound highlights drugs for Protein A in green, but also those for Protein B with high certainty (BSI > 0.8). For certain complex diseases, like some types of cancer, inhibiting two pathways at once is ideal. However, for other conditions, this guarantees side effects.
Non-specific or Promiscuous Compound (Over 15% of the total dataset): If the molecule lights up against Kinase inhibitors, serotonin receptors, and ion channels simultaneously, be careful. The network is warning you that it's a highly non-specific pan-assay interference compound (PAINS). Essentially, it will stick to anything in the body and will likely be toxic. It's best to discard it.

About inactive controls: Just like in single-target assays, inactive molecules (decoys) won't appear in the final results. To validate your model, manually search the SMILES of compounds like aspirin or sildenafil; these should show zero green results against all proteins in your database.

2. Predictive Side-Effect Mapping (Off-Target)

One of the biggest advantages of a BSI model trained with multiple families is that it functions as a predictive toxicity panel.

If you evaluate your best candidate and get 200 green matches (BSI > 0.8), your next step is to sort the results by the "Target Protein" column and review exactly what lit up:

Target Confirmation (On-Target): If all 200 molecules strictly match inhibitors for the protein you intended to target, you are well on your way to a very safe drug.
Cross-Toxicity Detection (Off-Target): If your goal was to develop an anti-inflammatory, but you notice that 15 of the green compounds are known to block the heart's hERG channel or affect psychiatric receptors, the model just saved you years of testing. It's predicting that the compound could cause arrhythmias or severe adverse neurological effects.

3. Final Selection Criteria in Multi-Target Environments

When you have two excellent candidates (A and B) for your target protein, you must make a strategic decision based on safety and novelty.

The Cleanliness Factor (Selectivity): Suppose Candidate A has 100 greens against your protein, but 10 greens against toxic targets (like hERG). Meanwhile, Candidate B has only 50 greens for your target, but a flawless 0.0 for everything else. In real-world pharmaceutical development, Candidate B is the clear winner. The total absence of toxicity is almost always more valuable than a slight increase in potency.
Structural Novelty (Scaffold-Hopping): Some chemical backbones (like quinolines) are famous for interacting with multiple proteins at once. If your Candidate A uses one of these common backbones and Candidate B has an entirely new chemical structure (Tanimoto < 0.15) that the BSI model has never seen before, yet it still manages to exclusively target your protein, Candidate B is the one you should patent. You've just discovered a highly selective new chemical key.

Frequently Asked Questions (FAQ)