![]() Not really possible in your typical ETL system. When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours.Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.The advantage of their new system is that they can now look at their data in anyway they want: The future came a little early this year. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. As more and more data this solution broke down with a combination of load and operational problems.įacing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. ![]() ![]() Data was then broken into Merge Tables based on time so index updates weren't a problem. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. ![]() Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. The next big evolution was a single machine MySQL version. Then came a scripted version of the same process. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Advanced notification means better capacity planning, more accurate (and cheaper) purchase of hardware and infrastructure, and even the potential for better 'bin-packing' of capacity – all of this results in a better-used infrastructure, reducing sunk cost and maximizing cheap unit prices.How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct.īill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Why are service providers cutting best-case pricing? Commitment goes a long way: capital, recurring revenue, financial lock-in and improved cash flow are all good news from the corporate angle. AWS, Google, Microsoft and the rest don't make money when servers in their multi-billion dollar data centers stand idle. The economics of their business depend upon high levels of utilization for their hardware. It's also good, of course, for the cloud providers. In some circumstances, these customers see a 44% saving over the on-demand price, which is good for them. More interesting than the gradual fall in the on-demand price (paid by customers who turn up with a credit card and simply rent some cloud-based resources) is the more dramatic 12% reduction in the 'best-case' price paid by those who negotiate (aggressively) or commit to longer-term or sustained-use conditions. While the cloud providers themselves might like us to fixate on more dramatic 43% cuts in outbound bandwidth costs or 30% reductions in the price of the smallest virtual machines, real workloads combine a number of moving parts and don't really see such dramatic reductions in the monthly bill. What is noteworthy is that the 2% figure represents 451 Research's attempt to model price across an entire portfolio of services, including compute, storage, networking and higher-level services.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |