Glean Data Handling Skill

Glean Data Handling

Overview

Glean enterprise search ingests documents from dozens of connectors (Google Drive, Confluence, Slack, Jira, Salesforce, etc.) and builds a unified search index with permission-aware access control. Data types include indexed document content, connector metadata, user permission maps, query logs, and search analytics. All document content must be PII-filtered before indexing, permission boundaries must be preserved to prevent data leakage across teams, and retention policies must be enforced to comply with corporate governance and GDPR/CCPA obligations.

Data Classification

| Data Type | Sensitivity | Retention | Encryption | |-----------|-------------|-----------|------------| | Indexed document content | High (may contain PII) | Per source retention policy | AES-256 at rest | | User permission maps | High (access control) | Sync lifecycle | TLS + at rest | | Connector metadata | Medium | Until connector removed | AES-256 at rest | | Search query logs | Medium (reveals intent) | 90 days default | AES-256 at rest | | Search analytics/aggregates | Low | 1 year | TLS in transit |

Data Import

interface GleanDocument {
  id: string; datasource: string; title: string;
  body: string; permissions: { allowedUsers?: string[]; allowAnonymousAccess?: boolean };
  updatedAt: string; url: string;
}

async function indexDocuments(docs: GleanDocument[], datasource: string) {
  // PII strip before indexing
  const sanitized = docs.map(doc => ({
    ...doc,
    body: stripPII(doc.body),
  }));
  // Batch upload with pagination (max 100 per request)
  for (let i = 0; i < sanitized.length; i += 100) {
    const batch = sanitized.slice(i, i + 100);
    await fetch(`https://customer-be.glean.com/api/index/v1/bulkindexdocuments`, {
      method: 'POST',
      headers: { Authorization: `Bearer ${process.env.GLEAN_INDEXING_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ datasource, documents: batch }),
    });
  }
}

function stripPII(text: string): string {
  return text
    .replace(/\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, '[EMAIL_REDACTED]')
    .replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE_REDACTED]')
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN_REDACTED]');
}

Data Export

async function exportSearchAnalytics(startDate: string, endDate: string) {
  const res = await fetch(`https://customer-be.glean.com/api/v1/analytics`, {
    method: 'POST',
    headers: { Authorization: `Bearer ${process.env.GLEAN_API_TOKEN}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ startDate, endDate, metrics: ['query_count', 'click_through', 'zero_results'] }),
  });
  const data = await res.json();
  // Redact user identifiers from analytics export
  return data.results.map((r: any) => ({ ...r, userId: undefined, query: r.query?.length > 3 ? r.query : '[SHORT_QUERY_REDACTED]' }));
}

Data Validation

function validateDocument(doc: GleanDocument): string[] {
  const errors: string[] = [];
  if (!doc.id || doc.id.length > 512) errors.push('Invalid document ID');
  if (!doc.datasource) errors.push('Missing datasource identifier');
  if (!doc.title || doc.title.length > 1000) errors.push('Title missing or exceeds 1000 chars');
  if (!doc.body || doc.body.length === 0) errors.push('Empty document body');
  if (!doc.permissions) errors.push('Missing permissions — defaults to deny-all');
  if (doc.updatedAt && isNaN(Date.parse(doc.updatedAt))) errors.push('Invalid updatedAt timestamp');
  return errors;
}

Compliance

[ ] PII stripped from document body before indexing (emails, phones, SSNs)
[ ] Permission boundaries enforced: allowedUsers scope matches source system ACLs
[ ] Connector credentials stored in secret manager, rotated quarterly
[ ] Search query logs retained max 90 days, purged via automated job
[ ] GDPR right-to-erasure: delete all indexed content referencing a specific user on request
[ ] CCPA: honor do-not-sell signals for search analytics data
[ ] SOC 2 Type II audit trail for all indexing and deletion operations

Error Handling

| Issue | Cause | Fix | |-------|-------|-----| | 403 on bulk index | Expired or insufficient indexing token | Rotate token, verify datasource permissions | | Permission mismatch in search | Stale ACL sync from connector | Force re-sync connector permissions via admin API | | PII detected in indexed content | New PII pattern not in strip regex | Add pattern to stripPII, re-index affected datasource | | Zero-result queries spike | Connector sync failure, stale index | Check connector health dashboard, trigger manual re-crawl | | Rate limit 429 on indexing | Batch size too large or too frequent | Reduce batch to 50 docs, add 500ms delay between batches |

Resources

Next Steps

See glean-security-basics.

Agent Skills: Glean Data Handling

Install this agent skill to your local

Skill Files