-
Notifications
You must be signed in to change notification settings - Fork 740
Member enrichment null byte sanitization, retry policy updates #2661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request introduces several enhancements to the member enrichment workflow within the application. Key changes include the addition of retry configurations for workflow executions in the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (7)
services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (2)
Line range hint
41-46
: Consider extracting retry configuration as a constant.The retry configuration could be moved to a constant to improve maintainability and reusability across different workflows.
Consider applying this refactor:
+const ENRICHMENT_RETRY_CONFIG = { + backoffCoefficient: 2, + maximumAttempts: 10, + initialInterval: 2 * 1000, + maximumInterval: 30 * 1000, +} as const; // ... in the workflow executeChild(enrichMember, { // ... - retry: { - backoffCoefficient: 2, - maximumAttempts: 10, - initialInterval: 2 * 1000, - maximumInterval: 30 * 1000, - }, + retry: ENRICHMENT_RETRY_CONFIG, // ... })
Line range hint
32-35
: Consider implementing controlled parallelism.Processing all members in parallel (up to 100 with the new batch size) could still create significant load on downstream services. Consider implementing a chunked approach to control the degree of parallelism.
Example approach:
// Process members in chunks of 10 for controlled parallelism const chunk = <T>(arr: T[], size: number): T[][] => { return Array.from({ length: Math.ceil(arr.length / size) }, (_, i) => arr.slice(i * size, i * size + size) ); }; // Process in smaller parallel batches for (const memberChunk of chunk(members, 10)) { await Promise.all( memberChunk.map((member) => executeChild(enrichMember, {...})) ); }services/libs/common/src/utils.ts (1)
72-74
: Add TypeScript types and documentation for the new function.The function implementation looks correct, but could benefit from better documentation and type safety.
Apply this diff to improve the function:
+/** + * Sanitizes a string by replacing null bytes with '[NULL]' marker. + * Handles both '\u0000' and '\0' representations. + * @param str - Input string that may contain null bytes + * @returns Sanitized string with null bytes replaced, or empty string if input is null/undefined + */ -export const redactNullByte = (str: string | null | undefined): string => +export const redactNullByte = (str: string | null | undefined): string => str ? str.replace(/\\u0000|\0/g, '[NULL]') : ''services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (1)
23-27
: Consider adding maxInterval to retry configuration.The retry configuration is well-structured with appropriate initial interval and backoff. However, consider adding
maxInterval
to prevent exponential backoff from growing too large in edge cases.Consider updating the retry configuration:
retry: { initialInterval: '15 seconds', backoffCoefficient: 2, maximumAttempts: 3, + maximumInterval: '60 seconds', },
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (2)
Line range hint
28-72
: Add error handling for external service callsThe enrichment workflow makes several external calls without explicit error handling. While the retry policy helps with transient failures, we should still handle potential errors gracefully.
Consider wrapping external calls with try-catch blocks:
for (const source of sources) { // find if there's already saved enrichment data in source - const cache = await findMemberEnrichmentCache(source, input.id) + let cache; + try { + cache = await findMemberEnrichmentCache(source, input.id) + } catch (error) { + console.error(`Failed to fetch cache for member ${input.id} from source ${source}:`, error); + continue; + }
Line range hint
73-89
: Implement data squashing logicThe TODO comment indicates missing implementation for data squashing using LLM. This is a critical part of the enrichment process.
Would you like me to help implement the data squashing logic using LLM? I can:
- Design the squashing algorithm
- Implement error handling
- Add unit tests
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (1)
Line range hint
478-485
: Add tests for null byte sanitization edge casesWhile the implementation is correct, we should add tests to verify behavior with:
- Data containing null bytes
- Empty data
- Very large data objects
- Special characters in data
Would you like me to help generate comprehensive test cases for these scenarios?
Also applies to: 492-499
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
- services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (1 hunks)
- services/apps/premium/members_enrichment_worker/src/types.ts (1 hunks)
- services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (1 hunks)
- services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (2 hunks)
- services/libs/common/src/utils.ts (1 hunks)
- services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (3 hunks)
🔇 Additional comments (9)
services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (2)
37-38
: Improved workflow cancellation handling.The updated cancellation policies provide better reliability by ensuring proper cleanup of child workflows during cancellation. The combination of
WAIT_CANCELLATION_COMPLETED
andREQUEST_CANCEL
follows best practices for graceful workflow orchestration.
23-23
: Verify impact of reduced batch size on processing throughput.Reducing the batch size from 300 to 100 members per run will improve error recovery and reduce system load, but might increase overall processing time.
Let's analyze the historical execution times to assess the impact:
✅ Verification successful
Batch size reduction appears to be consistent with existing patterns
The codebase search reveals that different member enrichment workflows already operate with varying batch sizes, including a smaller batch size of 10 in the LF Auth0 enrichment workflow. This suggests that the reduction to 100 members per run aligns with established patterns and has likely been validated for similar workflows.
getMembersToEnrich.ts
: 100 members per rungetMembersForLFIDEnrichment.ts
: 10 members per runThe smaller batch size in the LF Auth0 workflow indicates that the system is designed to handle lower batch sizes efficiently, making this change a reasonable optimization for reliability without compromising system performance.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check historical workflow execution times to assess performance impact # Note: This helps verify if the reduced batch size significantly affects throughput # Search for workflow execution logs or metrics rg -A 2 "member-enrichment.*completed"Length of output: 40
Script:
#!/bin/bash # Let's try a different approach to understand the impact: # 1. Look for any performance-related tests or benchmarks # 2. Check for any documentation about performance considerations # 3. Look for related configuration or constants in other files # Search for test files related to member enrichment fd -e test.ts -e spec.ts -e test.js -e spec.js | grep -i enrich # Search for documentation about batch size or performance fd README.md | xargs rg -i "batch|performance|throttle" # Search for similar batch size constants across the codebase rg "MEMBER_ENRICHMENT.*=.*\d+" # Look for any monitoring or metrics related to enrichment rg -i "metrics.*enrich|monitor.*enrich"Length of output: 463
services/libs/common/src/utils.ts (1)
72-74
: Verify the security implications of null byte handling.The function helps prevent null byte injection attacks, but let's verify its usage in security-sensitive contexts.
✅ Verification successful
Let me check the specific usage in the identified files to ensure the null byte handling is appropriate.
Null byte handling implementation is secure and appropriately used
The
redactNullByte
function is used in a secure manner, specifically for sanitizing JSON data before database operations. The implementation:
- Safely replaces null bytes with '[NULL]' before database operations
- Is used defensively with null checks
- Only appears in database-related contexts where the sanitization is appropriate
- Is not used in file operations or command execution contexts where different security measures might be needed
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for potential security-sensitive contexts where null byte sanitization is crucial # Test 1: Check for file operations that might need null byte sanitization echo "Checking file operations..." rg -l "(?i)(readfile|writefile|fs\.|path\.)" | xargs rg "redactNullByte|escapeNullByte" -l || true # Test 2: Check for database operations that might need sanitization echo "Checking database operations..." rg -l "(?i)(query|insert|update|delete\s+from)" | xargs rg "redactNullByte|escapeNullByte" -l || true # Test 3: Look for potential command execution contexts echo "Checking command execution contexts..." rg -l "(?i)(exec|spawn|shell)" | xargs rg "redactNullByte|escapeNullByte" -l || trueLength of output: 657
Script:
#!/bin/bash # Check the specific usage context in the identified files echo "Checking activity.service.ts usage:" rg "redactNullByte" services/apps/data_sink_worker/src/service/activity.service.ts -B 2 -A 2 echo -e "\nChecking members_enrichment_worker usage:" rg "redactNullByte" services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts -B 2 -A 2Length of output: 935
services/apps/premium/members_enrichment_worker/src/types.ts (1)
26-29
: LGTM! Verify implementations of this interface.The new property is well-documented and its purpose is clear. However, this is a breaking change that requires updates to all implementations of
IEnrichmentService
.Let's verify all implementations of this interface:
✅ Verification successful
All implementations correctly include the new property ✅
Both implementations of
IEnrichmentService
already include thecacheObsoleteAfterSeconds
property with appropriate values:
- ProgAI service: 90 days (60 * 60 * 24 * 90)
- Clearbit service: 120 days (60 * 60 * 24 * 120)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Find all implementations of IEnrichmentService to ensure they're updated ast-grep --pattern 'class $_ implements IEnrichmentService { $$$ }'Length of output: 71
Script:
#!/bin/bash # Let's try a different approach to find implementations # First, let's find where IEnrichmentService is imported rg "import.*IEnrichmentService" -l # Then, let's search for class definitions in those files rg "class.*implements.*IEnrichmentService" -l # Also search for potential interface extensions rg "interface.*extends.*IEnrichmentService" -lLength of output: 354
Script:
#!/bin/bash # Let's check the implementation in both service files to verify if they include the new property rg -A 20 "class.*implements.*IEnrichmentService" services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts echo "---" rg -A 20 "class.*implements.*IEnrichmentService" services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.tsLength of output: 1673
services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (1)
Line range hint
1-85
: Verify retry intervals against workflow execution patterns.The different retry intervals (15s for members-enrichment vs 2s for LFID-enrichment) suggest different timing requirements. Let's verify these intervals align with actual execution patterns.
✅ Verification successful
Retry intervals are appropriately configured based on workflow characteristics
The retry intervals in the schedules align well with the workflow execution patterns found in the codebase:
getMembersToEnrich
has a 15s initial retry which matches its longer execution window (15 minutes timeout) and more complex enrichment processgetMembersForLFIDEnrichment
uses 2s initial retry which aligns with its shorter execution window (1 minute timeout) and simpler LFID-specific enrichmentThe retry configurations are consistent with the workflow characteristics and their timeout configurations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for workflow execution time logs or metrics to validate retry intervals # Look for execution time logs or timing-related constants rg -A 2 "execution.*time|duration" services/apps/premium/members_enrichment_worker/ # Look for any timing configurations in the workflows ast-grep --pattern 'const $_ = ${"seconds" | "minutes" | "hours"}' services/apps/premium/members_enrichment_worker/Length of output: 203
Script:
#!/bin/bash # Let's try a different approach to find timing patterns # Look for workflow implementations to understand their complexity ast-grep --pattern 'export async function $_($$_) { $$$ }' # Search for any timing related configurations or patterns rg -i "timeout|delay|sleep|interval|duration" services/apps/premium/members_enrichment_worker/src/ # Look specifically at the workflow implementations rg -l "getMembersToEnrich|getMembersForLFIDEnrichment" services/apps/premium/members_enrichment_worker/src/Length of output: 13804
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (2)
19-24
: Review the activity timeout configurationWhile the retry configuration is well-structured with exponential backoff, the
startToCloseTimeout: '10 seconds'
might be too restrictive considering the retry policy spans approximately 65 seconds (5s + 10s + 20s + 30s).Consider increasing the
startToCloseTimeout
to accommodate the full retry cycle:- startToCloseTimeout: '10 seconds', + startToCloseTimeout: '70 seconds',
Line range hint
1-89
: Add null byte sanitization for enrichment dataGiven that the PR objectives mention null byte sanitization, we should ensure that all data received from external sources is properly sanitized.
Add sanitization for enrichment data:
Consider adding sanitization when receiving data:
const data = await getEnrichmentData(source, enrichmentInput) + const sanitizedData = data ? sanitizeEnrichmentData(data) : null; if (!cache) { - await insertMemberEnrichmentCache(source, input.id, data) + await insertMemberEnrichmentCache(source, input.id, sanitizedData) if (data) { changeInEnrichmentSourceData = true }services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (2)
1-1
: LGTM: Import addition aligns with null byte sanitization objectiveThe addition of
redactNullByte
import is appropriate for implementing the null byte sanitization feature.
478-478
: Verify handling of existing data with null bytesThe null byte sanitization implementation looks good. However, we should verify how this affects existing data in the database.
Let's check for potential null bytes in existing data:
Also applies to: 492-492
✅ Verification successful
The previous search didn't provide enough context about the null byte sanitization implementation. Let's search for more specific code context around the member enrichment cache and sanitization logic.
Based on the search results, I can now provide a definitive response about the null byte sanitization implementation:
The null byte sanitization is properly implemented and safe to use
The code shows that:
- Null checks are in place before sanitization:
data ? redactNullByte(JSON.stringify(data)) : null
- The sanitization is consistently applied in both insert and update operations
- The sanitized data is properly parameterized in the SQL queries to prevent injection
- The implementation handles null data gracefully by returning null
The code is following best practices for data sanitization and storage in the database. There's no risk to existing data since:
- Null values are handled safely
- The sanitization is applied before storage
- The database operations use parameterized queries
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for null bytes in memberEnrichmentCache data # Note: This helps identify if there's existing data that might be affected by the sanitization echo "Checking for null bytes in memberEnrichmentCache data..." rg -l $'\x00' . echo "Checking for potential JSON strings in test files that might need sanitization..." rg -l 'JSON\.stringify.*data.*memberEnrichmentCache' -g '*.test.ts' -g '*.spec.ts'Length of output: 201026
Script:
#!/bin/bash # Search for null byte sanitization in member enrichment cache code rg -A 5 -B 5 'redactNullByte|memberEnrichmentCache' services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts # Search for existing data in member enrichment cache table rg -A 5 -B 5 'member_enrichment_cache' services/libs/data-access-layer/src/database/migrations/Length of output: 2131
Changes proposed ✍️
What
copilot:summary
copilot:poem
Why
How
copilot:walkthrough
Checklist ✅
Feature
,Improvement
, orBug
.Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes