1. The way to get the minimum number in Pandas:
lowest_income_county = income["county"][income["median_income"].idxmin()] #[income["median_income"].idxmin()] returns the index of minimum number.
high_pop_county = income[income["pop_over_25"] > 500000]
lowest_income_high_pop_county = high_pop_county["county"][high_pop_county["median_income"].idxmin()] #find the county that has more than500000
residents with the lowest median income
2. random function , after random seed, only one call of random will follow the seed:
random.seed(20) #setup the random seed
new_sequence = [random.randint(0,10) for _ in range(10)]
3. To select certain number of sample form data:
shopping_sample = random.sample(shopping, 4) #select 4 data from list shopping
4. Roll a dice for 10 times in the range 1 to 6, and histogram the result into to a hist with 6 bins.
def roll():
return random.randint(1, 6) # create a function to generate a random number from 1 to 6
random.seed(1)
small_sample = [roll() for _ in range(10)]
plt.hist(small_sample, 6)
plt.show()
5. Roll the dice for 100 times, and repeat this expertment 100 times:
def probability_of_one(num_trials, num_rolls):
probabilities = []
for i in range(num_trials):
die_rolls = [roll() for _ in range(num_rolls)]
one_prob = len([d for d in die_rolls if d==1]) / num_rolls
probabilities.append(one_prob)
return probabilities
random.seed(1)
small_sample = probability_of_one(300, 50)
plt.hist(small_sample, 20)
plt.show()
6. Random sampling is more important than picking up samples:
mean_median_income = income["median_income"].mean()
print(mean_median_income)
def get_sample_mean(start, end):
return income["median_income"][start:end].mean()
def find_mean_incomes(row_step):
mean_median_sample_incomes = []
for i in range(0, income.shape[0], row_step):
mean_median_sample_incomes.append(get_sample_mean(i, i+row_step)) # pick up the mean of 1-100, 2-101 ,3 -102
return mean_median_sample_incomes
nonrandom_sample = find_mean_incomes(100)
plt.hist(nonrandom_sample, 20)
plt.show()
def select_random_sample(count):
random_indices = random.sample(range(0, income.shape[0]), count)
return income.iloc[random_indices]
random.seed(1)
random_sample = [select_random_sample(100)["median_income"].mean() for _ in range(1000)] # get the mean of randomly 100 number
plt.hist(random_sample, 20)
plt.show()
7. If we would like to do some calculations between the sample columns, we can do it like this:
def select_random_sample(count):# This function is to get "count" number of sample from the data set
random_indices = random.sample(range(0, income.shape[0]), count)
return income.iloc[random_indices]
random.seed(1)
mean_ratios = []
for i in range(1000): # loop 1000 times
sample = select_random_sample(100)
ratio = sample[‘median_income_hs‘]/sample[‘median_income_college‘]
mean_ratios.append(ratio.mean()) # Get the mean of the ratio between two column and append it into the target list.
plt.hist(mean_ratios,20)
plt.show
8. Santistical Signifcance, the way to determine if a result is valid for a population or not:
significance_value = None
count = 0
for i in mean_ratios:
if i > 0.675: # We get 0.675 from another dataset
count += 1
significance_value = count / len(mean_ratios)# The result is 0.14, which means in the result there is only 1.4% percent of country salary is higher than the one we get from salary data from after the program. Which means the program is really successful
Statistics and Linear Algebra 5
原文:http://www.cnblogs.com/kingoscar/p/6127957.html