"you should be more aggressive with patching your Exadata cell & Database software as there are lots of bugs out there"
This was an intro to Exadata troubleshooting. Note that I started from measuring what matters, and drilled down from there:
- I measured the response time of my "business task" and noticed the difference when selecting 256th column
- I then broke down the response time of the session running my business task using Oracle wait interface and noticed the unexpected waits for cell single block reads while smart scan was in use
- I gathered additional evidence related to the cell single block reads wait event using Snapper - which in this case was the table fetch continued row statistic which happens
- Additionally I was aware of the related details like execution plan which didn't use indexes and the schema physical design (I actually described that fact table and noticed that it had over 255 columns etc)
This was enough to conclude what was the cause for the performance problem and find the bug causing it.
Note that I did not start from measuring "advanced" things like cell CPU or interconnect utilization, querying V$CELL_% views or running CELLCLI stats commands. The performance troubleshooting work should be done from top down also on Exadata and in this case I didn't even have to dig deep into the smart storage stack. Also, Kudos to the Exadata developer or architect who decided to make the storage cells send some of their performance stats back to the database layer during smart scan execution - so that much of the troubleshooting can be done using the top down approach like demonstrated above.