Prerequisites for Better Troubleshooting:

  • Journey ID
  • Account Name

Initial Verification Steps to be Followed by Business Teams (CRS/CSM/CSS/Onboarding/CAT):

  1. Verify User Presence and Reachability:
    • Email: Check if there are users in the list/segment and verify if the user is blacklisted or reachable by email.
    • SMS: Check if there are users in the list/segment and verify if the user is blacklisted or reachable by SMS.
    • App: Check if there are users in the list/segment and verify if the user is reachable via the app.



    • BPN: Check if there are users in the list/segment and verify if the user is blacklisted or reachable by web push.


Troubleshooting Steps to be Followed by CAT:

  1. Step 1: Re-verify CSM Checks:

    • Assume no checks were done by Business Teams and re-verify all verifications with a deeper dive, especially for reachability checks, tokens, etc.
       
  2. Step 2: Check Audit Logs for Journey Deployment:

    • In case a journey is deployed multiple times and communication is not triggered upon re-deployment, search for the journey's audit logs in the path:

      /net/tmp/audit_log/<panelname>/audit_log.log
    • Audit Log Indicators:
      • [status] => 1 shows the journey was deployed.
      • [redeploy_auto_condition] => 1 indicates that the user has chosen to skip existing users.
      • [redeploy_auto_condition] => 0 indicates that the user has chosen to send to all users.
    • Screenshot:



  3. Step 3: Analyze Papi UI Logs:

    • Papi UI logs will show if the CRON ran and which segment/list was selected.
    • Command to analyze Papi UI logs:

      cd /opt/tmp/papiui_log/
      grep "Dataset Cron Started Client id=><clientid>  Automation id=><autoid>" log-<date>*
    • Backup Server: backup6 server, Path -> /backup5/papi1/papiui_log/papiui_log

  4. Step 4: Check GoGetDataset Logs:

    • GoGetDataset logs will show the exact count of users that were triggered in a segment/list.
    • Command to check GoGetDataset logs:

      cd /var/log/apps/GoGetDataset
      zgrep "Cid : <clientid> - Autoid : <autoid>" goserverapi.log*
    • Output would be as below:

      goserverapi.log:2022-10-10 07:00:01.000 INFO ReqID : 6b1b4e34-5a73-4b7d-8052-8a3dc7243b51 Total User Count is : 11 for Cid : 81987 - Autoid : 70


    • grep the thread id '6b1b4e34-5a73-4b7d-8052-8a3dc7243b51' for complete logs
    • Backup Server: backup8 server, Path -> /data1/Go_Servers/goserver1/ & /data1/Go_Servers/goserver2/

      Observation:

    • If count is zero, this means no users were present in the list/segment
    • To verify for list you can check mysql
      Login to mysql and run the query "select * from phplist_listuser where listid='' order by entered asc;"

  5. Step 5: Analyze Subsapi: (Store/updates status of Journey and Corresponding journey details)
    • In rare cases, the journey will be deployed and then stopped even before it was fully deployed.
    • Subsapi logs will share when exactly journey was deployed and stopped.

  6. Step 6: Analyze Storm Logs:

    • Storm logs contain journey logs and show if the communication was triggered based on dataset conditions and flow details.
    • Storm topology log path: /var/log/apps/storm2x

      If logs older than 2 days path : /var/log/apps/backup/storm2x/20220910

    • Example Command for US IDC:
      ls -larth cg3* | awk '{print $NF}' | xargs zgrep "<user_email>" | grep 'autoid <autoid>'
    • Example Command for Indian IDC:
      ls -larth cg3* | awk '{print $NF}' | xargs zgrep "<user_email>" | grep 'autoid <autoid>'
    • Command: (cg3 if client id ends in 3)
    • Command for backup path will be(change the date):

      ls -larth 20221008*/cg3* | awk '{print $NF}' | xargs zgrep "Email_address" | grep 'autoid 7'



    • Conclusion: If Storm logs show error saying that the user is blacklisted,the user is skipped due to attribute check, then share these findings with the client.IF storm logs inconclusive raise to SRE for a better understanding.

Resolution / Next Steps:

  1. [Reason 1]:

    • Inform the client of the findings.
    • Provide clarity on why the problem occurred.
    • Provide Screenshot
  2. [Reason 2]:

    • Escalate to Cross-functional teams to investigate further.
    • Provide findings by sharing relevant details like Problem statement, Logs, Screenshots, Reference cases

Escalation Path:

  1. Initial Verification by Business Teams (CRS/CSM/CSS/Onboarding):

    • If resolution is not found after initial verification:
  2. Further Investigation by Helpdesk:

    • If the issue is not resolved after initial & further investigation:
      • CAT Escalation:

        • Escalate to: SME
        • Contact: SME1 / SME2
        • Email: Ticket with internal note and investigation
        • Slack Channel: [Your respective channel within Support]
      • Cross-functional Escalation:

        • Escalate to: Cross-functional Teams
        • Contact: SRE
        • Email: Child ticket to the appropriate team
        • Slack Channel (Only for escalation): [Your respective channel with Cross Functions]

Reference Ticket IDs:


Feedback on Playbook:

Please provide your feedback on the playbooks using the following link: Feedback Form