Peer to peer (P2P) databases is getting to be prevalent on the Internet for distribution and sharing of documents, applications and other digital media. The issue of answering large-scale ad hoc analysis queries, for example, aggregation queries, on these databases possess unique challenges. Exact solutions can be time-consuming and difficult to implement, given the distributed and dynamic nature of P2P databases. We display novel sampling-based strategies for approximate answering of ad hoc aggregation queries in such databases.
Computing a high quality random sample of the database efficiently in the P2P environment is complicated because of a few factors: the data is distributed(usually in uneven quantities)across many peers, within each peer, the data is often highly correlated and, moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solution.