Skip to main content

Code Duplication Detection

Primus Security v2.2.0 adds token-based code duplication detection (Phase 7) — the same algorithmic class as SonarQube's Copy-Paste Detector (CPD).

How it works

  1. Tokenise every .cs file, discarding whitespace and comments
  2. Normalise tokens — identifiers → $ID, string literals → $STR, numbers → $NUM (so renamed variables don't break matching)
  3. Sliding window — compute a Rabin-Karp rolling hash over windows of MinBlockTokens tokens
  4. Index fingerprints → locations
  5. Merge adjacent/overlapping blocks from the same file pair
  6. Report duplicate blocks with file, line range, and duplication percentage

Enabling

Duplication detection is opt-in (adds scan time proportional to codebase size):

// appsettings.json
{
"PrimusSecurity": {
"EnableDuplicationDetection": true,
"DuplicationMinBlockTokens": 100,
"QualityGate": {
"MaxDuplicateBlocks": 10
}
}
}

Or via the CLI:

primus-scan ./MyApp --duplication --max-duplication 10

Reading results

var result = await scanner.ScanAsync("./MyApp");
var dup = result.DuplicationReport;

if (dup != null)
{
Console.WriteLine($"Duplicate blocks: {dup.DuplicateBlocks.Count}");
Console.WriteLine($"Duplicated tokens: {dup.DuplicatedPercent:F1}%");

foreach (var block in dup.DuplicateBlocks)
{
Console.WriteLine("Duplicate block:");
foreach (var loc in block.Locations)
Console.WriteLine($" {loc.FilePath}:{loc.StartLine}-{loc.EndLine}");
}
}

Configuration reference

OptionDefaultDescription
EnableDuplicationDetectionfalseOpt-in — disabled by default
DuplicationMinBlockTokens100~10 lines of code. Increase to reduce noise
DuplicationMaxBlocks-1Quality gate threshold. -1 = disabled
QualityGate.MaxDuplicateBlocks-1Equivalent gate field in QualityGate object

SARIF output

When duplication is enabled, the SARIF run.properties section includes:

{
"duplicateBlocks": 3,
"duplicationPercent": 4.2
}

Performance notes

  • Runs after SAST analysis in ScanAsync()
  • Scales linearly with total token count
  • A 50,000-line codebase typically completes in under 3 seconds
  • Set DuplicationMinBlockTokens = 200 for large repos to reduce noise