Volume Scalability => the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB.
Latency Scalability => the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week.
Sophistication Scalability => the solution must support simple summing scans or complex multi-way joins and statistics functionality and the cost must scale linearly in the range of simplistic scans to full blown SQL:2008/MDX/imperative in-database-analytics/MapReduce. Report/index viewing is not considered as analytics at all and particularly as not low-sophistication analytics. Report/index creation is analytics and can be of varied sophistication degree. ETL systems is considered as independent analytic systems.
Security => any unauthorized access to data must be prevented and in the same time, in-place data analysis (like predicate evaluation) must be possible and resource-efficient.
Keeping data always encryption and keeping keys always on client will not work. It will require shipping all the data to the client and is non-starter for big data analytics. So compromises must be made. The issue is especially contentious in public cloud setting.
If data is stored encrypted and is continuously decrypted in-place for predicate evaluation, for example, it means that keys must kept in same place (at least temporarily) and it compromises the whole scheme altogether, flooring its cost-benefit factor. The cost of decryption is pretty high.
De-identification of all fields may work; random scaling may be applied to numeric fields with subsequent query/result rewrite.
Security-by-obscurity methods and defense-in-depth approach may have good cost-benefit factors matching or exceeding overall security for in-house approach.
Cost => must have low-TCO that scales linearly to dataset size and the load factor caused by submitted queries. The breakdown (assuming cloud):
Storage component linear to dataset size. Economies of scale must bring this cost down significantly. Eventually it must be cheaper than on-site storage.
Computing component linear to load with infinite intra-query automatic elasticity. Guarantied elasticity may bear a fixed premium proportional to guarantied capacity. Minor failures of cloud component must not restart long running queries.
Bandwidth component. Fedexing hard-drives are by far the cheapest way to upload data, and then query results are really small. How much information human can comprehend instantly after all?
serialized objects / nested data.
bio / scientific
and other data forms must be equally well supported and cross-queried.