2016-10-21
There are 32,000+ datasets at NASA, and NASA is interested in understanding the connections between these datasets and also connections to other important datasets at other government organizations outside of NASA. Metadata about the NASA datasets is available online in JSON format. Let’s look at this metadata, specifically in this report the description and keyword fields. Let’s use topic modeling to classify the description fields and connect that to the keywords.
Topic modeling is a method for unsupervised classification of documents; this method models each document as a mixture of topics and each topic as a mixture of words. The kind of method I’ll be using here for topic modeling is called latent Dirichlet allocation (LDA) but there are other possibilities for fitting a topic model. In the context here, each data set description is a document; we are going to see if we can fit model these description texts as a mixture of topics.
Let’s download the metadata for the 32,000+ NASA datasets and set up data frames for the descriptions and keywords, similarly to my last exploration.
Just to check on things, what are the most common keywords?
To do the topic modeling, we need to make a DocumentTermMatrix
, a special kind of matrix from the tm package (of course, there is just a general concept of a “document-term matrix”). Rows correspond to documents (description texts in our case) and columns correspond to terms (i.e., words); it is a sparse matrix and the values are word counts (although they also can be tf-idf).
Let’s clean up the text a bit using stop words to remove some of the nonsense “words” leftover from HTML or other character encoding.
Now let’s make the DocumentTermMatrix
.
Now let’s use the topicmodels package to create an LDA model. How many topics will we tell the algorithm to make? This is a question much like in kk-means clustering; we don’t really know ahead of time. We can try a few different values and see how the model is doing in fitting our text. Let’s start with 8 topics.
We have done it! We have modeled topics! This is a stochastic algorithm that could have different results depending on where the algorithm starts, so I need to put a seed
for reproducibility. We’ll need to see how robust the topic modeling is eventually.
Let’s use the amazing/wonderful broom package to tidy the models, and see what we can find out.
The column ββ tells us the probability of that term being generated from that topic for that document. Notice that some of very, very low, and some are not so low.
What are the top 5 terms for each topic?
Let’s look at this visually.
We can see what a dominant word “data” is in these description texts. There do appear to be meaningful differences between these collections of terms, though, from terms about soil and land to terms about design, systems, and technology. Further exploration is definitely needed to find the right number of topics and to do a better job here. Also, could the title and description words be combined for topic modeling?
Let’s find out which topics are associated with which description fields (i.e., documents).
The column γγ here is the probability that each document belongs in each topic. Notice that some are very low and some are higher. How are the probabilities distributed?
The y-axis is plotted here on a log scale so we can see something. Most documents are getting sorted into one of these topics with decent probability; lots of documents are getting sorted into topics 2, and documents are being sorted into topics 1 and 5 (6?) less cleanly. Some topics have fewer documents. For any individual document, we could find the topic that it has the highest probability of belonging to.
Let’s connect these topic models with the keywords and see what happens. Let’s join
this dataframe to the keywords and see which keywords are associated with which topic.
Let’s keep each document that was modeled as belonging to a topic with a probability >0.9>0.9, and then find the top keywords for each topic.
Let’s do a visualization for these as well.
These are really interesting combinations of keywords. I am not confident in this particular number of topics, or how robust this modeling might be (not tested yet), but this looks very interesting and is a first step!
NASA Metadata: Topic Modeling of Description Texts
原文:https://www.cnblogs.com/tecdat/p/12035890.html