I think when talking about reliable (trustworthy) evidence, you need to be very careful about whether all the right tests are being applied for providing said evidence. Not just the right tests, but all the right tests ... omissions can be disastrously misleading.
I was once told a story (no idea of its truth but it illustrated the point) of sea trials being done on a new ship-board missile system. One of the missiles misfired, so the ship turned around to return to base ... and the missile exploded in its launch tube. Apparently the missile had (as I presume they all do these days) a new safety mechanism (flawed in this case!) whereby after launch, if the missile went out of control and headed back towards the ship, it self-destructed. Unfortunately in this case the system failed to realise the missile had not left the ship in the first place.
Although this sea trial did rather spectacularly prove the system was not safe, preceding trials should have already avoided it happening. The preceding trials would have inevitably indicated "reliable evidence" to safely move on to sea trials, but at least one crucial aspect of testing was overlooked. What seemed like reliable evidence from earlier trials was in fact deeply unreliable, because the test specification had a huge hole in it.
It is usually quite easy to specify a raft of tests that need doing to "prove" something, but invariably much harder to then identify what crucial tests are still missing. Quality peer reviewing typically plays its part here, because spotting what is not there often benefits from lateral thinking and collaborative discussions.