If you have read any of my previous blog posts, you may have seen that I have already discussed data analysis in depth. If you are unaware of data analysis or have forgotten what it is, it is simply the process of analyzing and modeling data to find out useful information and make conclusions about the data. At MIT, there are researchers in the Computer Science sector that have created a new set of algorithms that “can efficiently fit probability distributions to high- dimensional data” (MIT, 2016). This is helpful because many of the apps and websites we use everyday are high-dimensional data and knowing how to solve their corruption, if it happens to occur, is very beneficial to making our lives easier.
How data works is that if it contains corrupted lines, it could lead to the the standard data- fitting technique breaking down and causing the data to not function properly. Having data with many dimensions, with an immense amount of lines of code, makes having any form of a corruption much harder to detect and correct. The researchers at MIT found that using the median to find the mean of the data is less likely to yield corrupted data than an algorithm that uses the average. They took this into consideration when trying to form an algorithm.
Commonly, Computer scientists often use 2-D cross sections of the graph of the data to test whether or not they look like “Gaussian distributions”. Gaussian distributions are continuous functions that estimate the exact binomial distribution of events. Data that does not look like Gaussian distributions likely has corruption within it. They used the concept of Gaussian and combined it with a common distribution called "product distribution" and used it to create an algorithm with efficiency and applicability to the real world as its central focus.
References:
https://www.eecs.mit.edu/news-events/media/finding-patterns-corrupted-data
https://en.wikipedia.org/wiki/Data_analysis

I learned about Gaussian distribution in AP Statistics last year, but I didn't know that normal curves could also be used to detect data corruption. Yeah data analysis with large sets of data would be difficult and error-prone. An article I just wrote this week is about how Hadoop could be utilized to handle big data, so feel free to check it out!
ReplyDelete