User:Mark W. Miller
Articles written or extensively rewritten
[edit]American Civil War
[edit]- Battle of Salineville (written before registering, editted by User:Scott Mingus)
As of June 2008 the article still largely reflects my contributions. I have always felt a little bad about mentioning events that surrounded the surrender.
Math and Statistics
[edit]- Matrix population models (written)
As of June 2008 the article still largely reflects my contributions, which wasn't much. The article remains only a stub.
Biometrics
[edit]- Mark and recapture (extensively rewritten)
As of June 2008 the article still largely reflects my contributions. Capture-recapture is an enormous field that presently involves some extremely complex models. However, the Lincoln-Petersen method is basically the beginning and the method is still used. The references at the bottom of the page will expose readers to the broader world of capture-recapture if they are interested. Quite a few good books are available on the subject. Some can be downloaded for free from the internet.
Sports
[edit]- Aaron Smith (football player) (extensively rewritten)
As of June 2008 the article still largely reflects my contributions. Smith's stats are up-to-date, as is his current situation with the Pittsburgh Steelers. Many other people have contributed to his page since I added to it.
Material below has been helpful to me and hopefully will be helpful to others
[edit]Newton-Raphson Method and Hessian Matrix
[edit]NOTE TO SELF: THERE WAS A TYPO IN THE HESSIAN MATRIX IN THE BOOK. THE INVERSE OF THE HESSIAN IS CORRECT, BUT THE HESSIAN ITSELF WAS NOT.
THE HESSIAN NOW IS CORRECT BELOW.
I posted a question recently about the Hessian Matrix on the Math Talk page and received some good replies. I posted a follow-up question later in the Math and Science Question section, but then found the error in the book myself. Below deals with using the Hessian Matrix and Newton-Raphson Method to locate the minima or maxima of a function. The example in REA's Problem Solvers book "Operations Research", p. 739-740, contains an error in the Hessian Matrix.
The function in the example contains two unknowns, x1 and x2, and is:
4x12 + 2x1x2 + 2x22 + x1 + x2.
The problem is to find the values of x1 and x2 that provide the minima.
The matrix of partial derivatives is: f(x) = = 0.
The Hessian Matrix is:
H(x) = ;
Not:
H(x) = as in the book.
The third matrix is indeed the inverse of the Hessian Matrix, and is given correctly in the book as:
H-1(x) = (1/14) * .
The interative formula used to find the values of x1 and x2 corresponding to the minima of the original function is:
x(n+1) = x(n) - H-1(x) * f(x);
and gives the values x(n+1)1 = -1/14 and x(n+1)2 = -3/14.
I understand, at least mechanically, the substitutions and algebra used to arrive at those answers given the formula for x(n+1).
Note that:
H(x) * H-1(x) = .
Note that the eigenvalues of the Hessian Matrix are 8.8284271 and 3.1715729, which I think corresponds to a minima, which is what the REA "Operations Research" book said was the case with this particular function.
Installed MiKTeX on an Windows XP computer. The MiKTeX download included LaTeX.
I opened a DOS window and went to the folder containing 'latex.exe':
C:\Program Files\MiKTex 2.7\mikTex\bin>
then typed 'latex hello':
C:\Program Files\MiKTex 2.7\mikTex\bin>latex hello
and it worked.
Mark W. Miller (talk) 06:57, 19 May 2009 (UTC)
The steps immediately above worked to create a file called 'hello.dvi'. I was able to convert 'hello.dvi' into a PDF file using the following two steps:
Step 1. Create a PostScript file:
C:\Program Files\MiKTex 2.7\mikTex\bin>dvips hello.dvi -o hello.ps
Step 2. Create a PDF file using the PostScript file from the first step:
C:\Program Files\MiKTex 2.7\mikTex\bin>ps2pdf hello.ps hello.pdf
Note that the file 'hello.tex' was also in the folder:
C:\Program Files\MiKTex 2.7\mikTex\bin>
as were the executable files 'dvips.exe', 'ps2pdf.exe' and 'latex.exe' which were installed automatically when I installed 'MiKTeX 2.7'.
The file 'hello.tex' was obtained via an internet search, copied, and pasted into MS Wordpad and saved on my computer in plain text format.
Mark W. Miller (talk) 08:17, 7 June 2008 (UTC)
Installed TeXnicCenter which serves as a shell or text editor or graphical user interface for LaTeX and created another version of 'hello.pdf' from within the TeXnicCenter application.
Mark W. Miller (talk) 12:36, 7 June 2008 (UTC)
Questions I Have Posted at the Mathematics Reference Desk
[edit]Below are some questions I have posted at the Mathematics and Computing Reference Desks. The answers were helpful and I have periodically gone back to refresh my memory.
1. Derivative and partial derivative of what I called the inverse of the logit function.
Note that the answer was vandalized, but I do not think the content of the question or answer was changed.
2. Estimating the variance-covariance matrix of a multinomial function
http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Mathematics/2007_September_28
Note that I answered my own question. I believe my answer is correct. Hopefully so. If somebody types "Estimating the variance-covariance matrix of a multinomial function" into Google, the first hit returned is my question and answer!
http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Mathematics/2007_September_22
This question dealt with how to determine whether a variable follows a Poisson Distribution.
4. Displaying the gradient when optimizing in R (programming language)
Note that this Question #4 was posted at the Computing Reference Desk, not the Mathematics Reference Desk. The question was not answered there.
Here is a possible solution I have not yet tried:
http://tolstoy.newcastle.edu.au/R/e6/help/09/02/5786.html
Also, I might try the spg() function. I think the optimization function nlm in R will show the gradient too.
Example of Double Integration with Program R
[edit]Program R allows you to perform double and triple integration. I had trouble locating example R code for double integration, so I provide an example here. The function is f(x,y) = x^2 * y; x ranges from -1 to 2 and y ranges from 0 to 2; and the answer is 6.
library(adapt)
fint <- function(p) {
x = p[1]
y = p[2]
x^2 * y }
adapt(ndim = 2, lower = c(-1,0) , upper = c(2,2), functn = fint)
Example of Triple Integral in Program R
[edit]R code for a triple integral taken from page 808 of Ellis and Gulick, second edition (1982). The answer is -54.
library(adapt)
fint <- function(p) {
x = p[1]
y = p[2]
z = p[3]
y - x * z }
adapt(ndim = 3, lower = c(2,-1,0) , upper = c(4,1,3), minpts=10000, functn = fint, eps = 1e-6)
Mark W. Miller (talk) 07:32, 19 May 2009 (UTC)
Reading an Excel 2007 file into Program R
[edit]The R code below reads (or imports) and displays data from an Excel 2007 file named 'test.xlsx' on the U drive.
library(RODBC)
channel <- odbcDriverConnect("DRIVER=Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb); DBQ=U:\\test.xlsx; ReadOnly=False")
sqlTables(channel)
my.data <- sqlFetch(channel, "Sheet1")
my.data
Mark W. Miller (talk) 00:27, 30 October 2009 (UTC)
Add Confidence Intervals to a Plot in Program R when you Have Three Columns of Numbers
[edit]The code below adds confidence intervals to a plot in Program R when you, in effect, enter the values of the confidence intervals by hand instead of having them generated by a linear model statement, for example.
The program uses a loop to plot the points and their confidence intervals one at a time. The arrow statement is used to add the confidence intervals and the point statement is used to add their point estimates.
I learned how to do this by dissecting the R code at the following site:
http://yihui.name/en/2007/10/demonstration-of-confidence-intervals-using-r/
I am not sure why there are dashed lines around sections of the code, but if you copy all of the code and paste it into Program R it should work. Then you can modify it to suit your purposes.
# the first two lines determine the size and position of the plot, default is a large box
# par(mar = c(4.5, 4, 2.5, 0.5)) # layout(matrix(1:2, 2), heights = c(0.6, 0.4))
n = 5 # number of points on the graph
m = c(2,3,4,5,6) # point estimates
y0.a = c(0.25, 0.5, 0.75, 1.0, 1.25) # value to subtract from m for lower CI y1.a = c(0.25, 0.5, 0.75, 1.0, 1.25) # value to add to m for upper CI
y0 = m - y0.a # lower 95% confidence interval y1 = m + y1.a # upper 95% confidence interval
# the plot statement below creates the plot border and labels
plot(1, xlim = c(0.5, n + 0.5), ylim = c(min(y0), max(y1)), type = "n", xlab = "Samples", ylab = "Measurement", main = "My Simple Title")
# the loop below adds the points and their CI bars one point at a time
for (i in 1:n) { arrows(i, y0[i], i, y1[i], length = 0.05, angle = 90, code = 3, col = "black")
points(i, m[i]) }
Mark W. Miller (talk) 01:49, 23 December 2009 (UTC)
View or Read Source Code in Program R
[edit]To see the source code for a function in Program R download the package containing the function. Specifically, download the file that ends in "tar.gz". This is a compressed file. Expand the compressed file using, for example, "WinZip". Now you need to open the uncompressed file that ends in ".tar". Download the free software "7-Zip". Click on the file "7zFM.exe" and navigate to the directory containing the ".tar" file. You can extract the contents of that ".tar" file into a new folder. The contents consist of R files showing the source code for the functions in the R package.
Mark W. Miller (talk) 18:53, 28 September 2010 (UTC)
Program R code for a Bayesian Analysis of Coin Flips
[edit]The Program R code below calls WinBUGS to estimate the probability of obtaining heads given that 17 heads have been observed in 20 flips. Note that WinBUGS returns a point estimate for p of 0.819 rather than 0.85. If the number of heads and flips is multiplied by 1000 to 1700 and 2000 then the point estimate of p will be much closer to 0.85. You can copy the code below and paste in directly into Program R and it should run if you also have WinBUGS installed and have installed the R2WinBUGS R package. You will probably have to change, or create, the directory I use in the sink statement and the bugs statement.
library("R2WinBUGS")
r <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0)
my.data <- list ( "r" )
my.params <- c( "p" )
my.inits <- function(){list( p = 0.85 )}
sink("c:/Bayesian coin flip/bayesian coin flip.txt")
cat("
model {
for(i in 1:20) {
r[i] ~ dbin(p,1)
}
p ~ dunif(0,1)
}
",fill=TRUE)
sink()
fit <- bugs (my.data, my.inits, my.params, "c:/Bayesian coin flip/bayesian coin flip.txt",
n.chains=3, n.iter=100000, n.burnin=50000, n.thin = 5, debug=TRUE )
print(fit, digits=5)
Mark W. Miller (talk) 07:51, 14 June 2011 (UTC)
C Code Input Output File or C Code Read Write File
[edit]C Code to Read a Data File, Perform a Calculation, and Write a Data File
The C code below reads an input data file, performs a simple calculation on the data, and writes the data and the result of the calculation to a new output data file. The C code also displays the data and the result of the calculation on the computer screen to help in debugging the code if an error should occur.
In this example the data file I used is called "mydata.dat" which consisted solely of a single column of numbers 1-20 (i.e., with no header and no delimiter). However, nothing in the C code below specifies that there are 20 observations in the data file. You can use any column of integers you wish (and as many integers in the column as you wish) in your input data file.
Note that I am not clear on how to format text for Wikipedia. The code below is correct. I am not sure why some lines below are displayed inside a box and some lines are not. I think the indented lines below are displayed inside a box.
Nevertheless, you should be able to copy the code below directly from the computer screen, paste it into a text file named, for example, "readwrite.c", and the code should compile to an .exe file using, for example, Jens' File Editor. The resulting .exe file should read, calculate, display and write when executed (i.e., when clicked on).
#include <stdio.h>
#include <math.h>
char quit;
main()
{
FILE *md, *mdout;
int i;
int ii ;
if (((md = fopen("mydata.dat" , "r")) == NULL) |
((mdout = fopen("myoutput.dat", "w")) == NULL)) printf("a file could not be opened\n");
else {
printf("%-10s%s\n", "I", "(I + 1000)"); fscanf(md, "%d", &i);
while (!feof(md))
{
ii = i + 1000 ;
printf("%-10d%d\n", i, ii);
fprintf(mdout, "%-10d%d\n", i, ii);
fscanf(md, "%d", &i);
}
}
printf("To close this program type 'quit' and hit the return key\n"); printf(" \n");
scanf("%d", &quit);
return 0;
}
Mark W. Miller (talk) 19:49, 19 June 2011 (UTC)
R Code For Multiplying a Vector and a Matrix
[edit]If you want to multiply a vector and a matrix you need to remember that 'two matrices can be multiplied only when the number of columns in the first equals the number of rows in the second' (quoted text from the Wikipedia article titled 'Matrix (mathematics)').
I wanted to multiply the column matrix 'a':
0.4 0.6
and the 4x4 matrix 'b':
0.3 0.45 0.7 0.55
I expected the answer to be:
0.39 0.61
However, I received an error: 'Error in a %*% b : non-conformable arguments' when I typed:
a %*% b
I obtained the answer I expected by switching the order of a and b:
b %*% a
Below is the R code:
a <- matrix(c(0.4,0.6), nrow=2)
a
b <- matrix(c(0.3, 0.7, 0.45, 0.55), nrow=2, byrow=F)
b
a %*% b
b %*% a
Mark W. Miller (talk) 22:22, 10 December 2011 (UTC)
R Code for Entering a Data Set from the Keyboard when both Numeric and Character Variables are Present
[edit]Most data sets seem to have both numeric and character variables. However, entering such data sets into Program R with the keyboard is difficult. Matrices in R require that all variables be the same type or mode (e.g., variables either are all numeric or all character). Below is R code showing an example of how to convert a variable from character to numeric after first entering the data into a matrix using the keyboard. This seems to be a common problem, but I have not seen the solution presented elsewhere.
First enter the data into a matrix. Then convert the matrix to a data frame while specifying that none of the variables are to be treated as factors. Instead all of the variables will be considered categorical. This is specified with the stringsAsFactors=F statement as shown below. Then use the aa[,1] <- as.numeric(aa[,1]) statement to convert, in this example, the first column from character to numeric. Now you have a data frame with one numeric and two character variables. One alternative is to use the read.table statement to read the data from an external file. However, the code below shows how to do it when data are entered directly from the keyboard.
aa <- matrix(c(
26,1,"L",
70,1,"L",
52,1,"L",
51,1,"L",
67,1,"L",
18,1,"M",
35,1,"M",
30,1,"M",
36,1,"M",
36,1,"H",
21,1,"H",
43,1,"H",
26,1,"H",
27,2,"L",
14,2,"L",
42,2,"M",
26,2,"M",
20,2,"H",
21,2,"H",
24,2,"H",
28,3,"H"),
nrow = 21, ncol=3, byrow=T)
aa <- as.data.frame(aa, stringsAsFactors=F)
colnames(aa) <- c("abundance", "variable1", "variable2")
str(aa)
aa[,1] <- as.numeric(aa[,1])
str(aa)
zz1 <- by(aa[, 1], aa[,2:3], max)
zz1
Mark W. Miller (talk) 23:17, 21 December 2011 (UTC)
C Code to Enter, Fill, Load, Put or Place Numeric Data into an Array or Matrix
[edit]If you want to put static numeric data into an array in C then perhaps try the following example of a 2-dimensional array with 3 rows and 4 columns:
double myarray[3][4] =
{{1, 2, 3, 4},
{5, 6, 7, 8},
{9, 10, 11, 12}};
Mark W. Miller (talk) 04:10, 6 March 2012 (UTC)
Running Compiled C Code and Exporting to a Text Output File
[edit]Here is an alternative way to export output from a C file to a text file. I wrote the standard 'hello, world' file using Jens' File Editor and saved the file as 'helloworld.c'. Then clicked on the 'CompileLink' tab in Jens' File Editor to create the executable 'helloworld.exe'. Here are the contents of 'helloworld.c' without indentation.
#include <stdio.h>
main()
{
printf("hello, world\n");
}
Clicking on 'helloworld.exe' will cause the program to flash the expression 'hello, world' on the screen too quickly to be read. At least I think so. The flash is so brief I cannot be certain what is printed on the screen.
I next opened a command window in Windows by typing 'cmd' in the box in the lower left corner of the screen under the Windows icon. Suppose the files 'helloworld.exe' and 'helloworld.c' are in a folder 'c:\c stuff'. In the command window change directory to 'c:\c stuff' and then create the line 'c:\c stuff\helloworld i=helloworld.c > helloworld.out'. That will create an output text file 'helloworld.out' containing the expression 'hello, world'.
I am using the free MinGW gnu C compiler. Jens' File Editor is also free. MinGW is in the root (c:\) directory. Jens' File Editor is not. On one computer I had to alter the path statements in the Jens' File Editor file 'jfe.ini' before Jens' File Editor would compile C code. The path statements in 'jfe.ini', I believe, must point to the location of the MinGW application.
Mark W. Miller (talk) 21:08, 19 March 2012 (UTC)
R Code for Coin Flips with Frequentist Logistic Regression
[edit]Below is R code for estimating the probability of getting heads if you flip a coin 100 times and get 20 heads. The code uses frequentist logistic regression. The point estimate of the probability is obtained three different way: 1) e(x) / (1+e(x)), 2) plogis and 3) predict. The third approach, predict, also returns the standard error of the point estimate of the probability by setting se.fit=T.
N <- 100
C1 <- 20
my.model1 <- glm(cbind(C1, N - C1) ~ 1, family = binomial)
my.model1
exp(as.numeric(my.model1[1])) / (1 + exp(as.numeric(my.model1[1])))
plogis(as.numeric(my.model1[1]))
my.parm1 <- predict(my.model1, type="response", se.fit=T)
as.numeric(my.parm1[1:2])
my.logLik1 <- logLik(my.model1)
my.logLik1
Mark W. Miller (talk) 09:17, 11 July 2012 (UTC)
R Code for Estimating Probability of Heads when Simulating Coin Flips
[edit]Suppose you want to estimate the probability of obtaining heads by flipping a coin, and you do not want to use either frequentist or Bayesian logistic regression. You could use the R code below to obtain both a point estimate of the probability of obtaining heads and an estimate of the standard error of the probability of obtaining heads.
Suppose the true probability of obtaining heads is 0.20. For an individual attempt at estimating that true probability you flip the coin 100 times and record the number of heads divided by the number of coin flips. That quantity is stored in a vector called 'est.p' in the R code below. I refer to 100 flips as an 'iteration' or a 'set'. Now you must repeat your set (your 100 flips) numerous times and for each set record the number of heads divided by 100 (the number of coin flips per set in this example). The R code below simulates one million sets with 100 coin flips in each set. In the R code below 'iters' is the number of sets, 'N' is the number of coin flips per set, and 'p.heads' is the true probability of obtaining heads on any given coin flip (which is used to simulate the data).
After repeating a large number of sets your point estimate of the probability of obtaining heads is the mean of the vector est.p (in other words, the mean of the one million values stored in the vector 'est.p'.). The estimate of the standard error of the probability of obtaining heads is the standard deviation of the one million values stored in the vector 'est.p'.
If you increase the number of sets from one million to ten million or 100 million your estimates obtained with the R code below might be virtually identical to the estimates you could obtain using the R code for logistic regression provided above. However, if you attempt 100 million sets you might have to wait a long time for the R code to run and your computer might crash before the 100 million sets have been simulated. Note that if you obtain 20 heads out of 100 flips in one set then the frequentist logistic regression code above will return p(heads) = 0.20 (SE = 0.04), while the R code below (with one million sets) returns p(heads) = 0.1999797 (SE = 0.03997182) which is identical if values are rounded to 4 decimal places.
set.seed(1234)
iters <- 1000000
N <- 100
p.heads <- 0.20
est.p <- rep(NA, length=iters)
for(i in 1:iters) {
C1 <- rbinom(N,1,p.heads)
est.p[i] <- sum(C1) / N
}
point.estimate.of.p.heads <- mean(est.p)
se.estimate.of.p.heads <- sd(est.p)
point.estimate.of.p.heads
se.estimate.of.p.heads
Mark W. Miller (talk) 11:02, 11 July 2012 (UTC)
R Code for Formatting Dates
[edit]Suppose you have a data set that contains a separate column for month, day and year and you want to convert those three columns into one date column using Program R. The following R code does that and also presents two ways to do the reverse. The second of those two ways is based on suggestions I found from DWin on Stackoverflow here:
http://stackoverflow.com/questions/6550678/split-date-into-different-columns-for-year-month-and-day
http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
#Create the data set
D <- data.frame(c('November', 'December'), c(16, 31), c(2012,2012), c('A','B'), c(10,100), stringsAsFactors=F)
colnames(D) <- c('month', 'day', 'year', 'group', 'result')
D
#Convert to R date format
D2 <- paste(D[,1], D[,2], D[,3], sep=' ')
D3 <- data.frame(as.Date(D2, "%B %d %Y"), D[,4:5], stringsAsFactors=F)
colnames(D3) <- c('date', 'group', 'result')
D3
#Revert to original format using an inefficient approach
DD3 <- D3
DD3$date <- as.character(DD3$date)
DD3$date <- gsub('-11-', ' November ', DD3$date)
DD3$date <- gsub('-12-', ' December ', DD3$date)
DD3
DD4 <- strsplit(DD3$date, "\\ ")
DD5 <- unlist(DD4)
DD6 <- do.call(rbind, DD4)
colnames(DD6) <- c('year', 'month', 'day')
DD7 <- DD6[, c("month", "day", "year")]
DD7 <- as.data.frame(DD7)
DD7$month <- as.character(DD7$month)
DD7$day <- as.numeric(as.character(DD7$day))
DD7$year <- as.numeric(as.character(DD7$year))
DD8 <- data.frame(DD7, DD3[,2:3])
DD8
identical(D, DD8)
#Revert to original format using a more efficient approach
DDD3 <- D3
dtstr <- as.character( DDD3[,1])
month <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 2)))
day <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 3)))
year <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 1)))
DDD4 <- data.frame(month, day, year, D[,4:5])
DDD4$month <- month.name[DDD4$month]
DDD4
identical(D, DDD4)
Mark W. Miller (talk) 23:34, 19 November 2012 (UTC)
Replace One Cell in a Data Frame in R
[edit]Below is R code showing how to replace one value or one cell in a data frame. This code works without knowing the row and column numbers of the cell you wish to replace. Note that with this example, if 'df$aa == 'D' & df$bb == 4' does not identify a unique cell in the data frame then any cell for which 'df$aa == 'D' & df$bb == 4' will be replaced with '25'. In the first example below 'df$aa == 'D' & df$bb == 4' does identify a unique cell in the data frame 'df'.
aa <- c('A','D','C','D','E')
bb <- c(1,2,4,4,5)
cc <- c(6:10)
df <- data.frame(aa,bb,cc)
df
df$bb[df$aa == 'D' & df$bb == 4] <- 25
df
str(df)
If you know the row and column numbers of the cell you wish to replace you could use the following. This will replace the value in that specific cell regardless of the value in that cell, even if the value in that cell is a missing observation 'NA'.
aa <- LETTERS[1:5]
bb <- c(1,4,3,4,5)
cc <- c(6:10)
df <- data.frame(aa,bb,cc)
df
df[4,2] = 25
df
str(df)
Mark W. Miller (talk) 02:01, 25 November 2012 (UTC)
Simple SURVIV Example
[edit]proc model npar=2 addcell ;
cohort = 10;
4 : s(1) * s(2) ;
4 : s(1) * (1 - s(2)) ;
1 : (1 - s(1)) * s(2) ;
labels;
s(1) = survival ;
s(2) = detection ;
proc estimate novar name=one;
initial;
s(1) = 0.50 ;
s(2) = 0.50 ;
proc stop;
Mark W. Miller (talk) 02:17, 4 December 2012 (UTC)
Cautionary Note About Subset in Program R
[edit]state = c('Ohio')
my.data = read.table(text = "
fruit state year grade units
apples Ohio 1990 aa 500
apples Ohio 1990 bb 600
apples Ohio 1990 cc 700
orange Ohio 1990 aa 80
orange Ohio 1990 bb 70
orange Ohio 1990 cc 60
apples Iowa 1990 rr 500
apples Iowa 1990 ss 600
apples Iowa 1990 tt 700
orange Iowa 1995 rr 10
orange Iowa 1995 ss 20
orange Iowa 1995 tt 30
", sep = "", header = TRUE, stringsAsFactors = FALSE)
# this does not work
my.data2 <- subset(my.data, my.data$state == state)
my.data2
# this works
my.data3 <- subset(my.data, my.data$state == 'Ohio')
my.data3
# this works
my.data4 <- my.data[my.data$state == state,]
my.data4
Mark W. Miller (talk) 08:07, 20 December 2012 (UTC)
R Code for ifelse statement
[edit]Below are several examples of R code for the ifelse statement. Two examples show how to use the ifelse statement with a vector. R code is also presented showing how the ifelse statement can replace a for-loop. I also include code for using the microbenchmark package to time the ifelse statement and compare its speed to that of the for-loop. To make that comparison I inserted the ifelse statement into a function and did the same for the for-loop. The ifelse statement was approximately six times faster than the for-loop. An ifelse statement can be used in this manner because it is vectorized, unlike the if statement.
x <- seq(30,80,5)
x
# [1] 30 35 40 45 50 55 60 65 70 75 80
ifelse(x > 50, (x + 20), x)
# [1] 30 35 40 45 50 75 80 85 90 95 100
df1 = read.table(text = "
AA BB CC DD
1 10 2 0.01
5 15 4 0.02
10 20 6 0.03
21 25 8 0.04
21 30 10 0.05
30 35 12 0.06
30 40 14 0.07
40 45 16 0.08
40 50 18 0.09
50 55 20 0.10
50 60 22 0.11
75 65 24 0.12
75 70 26 0.13
75 75 28 0.14
80 80 30 0.15
80 85 32 0.16
90 90 34 0.17
90 95 36 0.18", sep = "", header = TRUE)
df1
df2 <- df1
df3 <- df1
df4 <- df1
df1$AA <- ifelse(df1$AA > 50, (df1$AA + 20), df1$AA)
df1
df2[,1] <- ifelse(df2[,1] > 50, (df2[,1] + 20), df2[,1])
df2
fun1 <- function(x) { AA <- ifelse(x[,1] > 50, (x[,1] + 20), x[,1])
return(data.frame(AA, x[,2:ncol(x)]))}
df33 <- fun1(df3)
fun2 <- function(y) {
for (i in seq(1:nrow(y))) {
if(y[i,1] > 50) y[i,1] = (y[i,1] + 20) else(y[i,1] = y[i,1])
}
AA <- y[,1]
return(data.frame(AA, y[,2:ncol(y)]))
}
df44 <- fun2(df4)
df1 == df2
df1 == df33
df1 == df44
library(microbenchmark)
microbenchmark(fun1(df3), fun2(df4), times=20)
# Unit: microseconds
# expr min lq median uq max
# 1 fun1(df3) 261.706 265.027 269.5545 271.819 505.302
# 2 fun2(df4) 1732.330 1761.760 1782.7395 1824.396 2490.885
Mark W. Miller (talk) 01:09, 8 January 2013 (UTC)
R Code to Replace Missing Observations with 0 in One Column of a Data Frame
[edit]Below is code to replace NA with 0 in R if you only want to replace the missing observations in one column of a data frame. A similar post exists on StackOverflow here: http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r/15014847#15014847
df = read.table(text = "
state year grade X1 X2 X3
Ohio 1990 aa 1 2 3
Ohio 1991 bb 4 5 NA
Ohio 1992 cc NA 8 9
Ohio 1993 dd 10 NA 12
Ohio 1994 ee 13 14 15
Ohio 1995 ff 16 17 NA
", na.strings = "NA", header = TRUE, stringsAsFactors = FALSE)
df$X3[is.na(df$X3)] <- 0
df
Mark W. Miller (talk) 02:01, 22 February 2013 (UTC)
R Code to Select Rows in a Data Frame Based on Multiple Criteria
[edit]The R code below selects rows in a data frame where var1.a equals var1.b and var2.a equals var2.b and none of elements in those four columns are missing observations (or NA). I select those rows using a which statement and also using subsetting within brackets.
(When I copy and paste this code into R I get an error reading the data. That is because one space of white space is introduced in the empty rows in the data set as a result of including an empty line between rows so the data set appears reasonably formatted in Wikipedia. In other words, if you copy the code and paste it into Notepad, remove the empty space in the empty rows in the data set, then paste the revised data set into R it should run.)
df = read.table(text = "
state county year var1.a var2.a var1.b var2.b
1 1 1990 10 25 20 25
1 2 1990 20 15 20 15
2 1 1990 30 NA 40 25
2 2 1990 40 35 10 35
3 1 2000 20 45 10 NA
3 2 2000 20 55 20 55
4 1 2000 NA 65 NA NA
4 2 2000 80 NA 30 NA ", sep = "", header = TRUE)
df[!is.na(df$var1.a) & !is.na(df$var1.b) & !is.na(df$var2.a) & !is.na(df$var2.b) & (df$var1.a == df$var1.b) & (df$var2.a == df$var2.b),]
which(((df$var1.a == df$var1.b) & (df$var2.a == df$var2.b)), arr.ind=TRUE)
df[which(((df$var1.a == df$var1.b) & (df$var2.a == df$var2.b)), arr.ind=TRUE),]
The answer is below where the 2 and 6 in the first column are row numbers returned by R:
state county year var1.a var2.a var1.b var2.b
2 1 2 1990 20 15 20 15
6 3 2 2000 20 55 20 55
Mark W. Miller (talk) 07:23, 18 March 2013 (UTC)
A Note About Running R Code Posted on this Site
[edit]The R code posted on this site should run. However, if you copy the R code from here and paste it into R you may get an error when reading data sets. That seems to be because white space is included in the empty rows I use to format data sets for Wikipedia. You can copy the R code from here, paste it into Notepad, remove the empty rows in the data sets, paste that revised code into R and it should run.
Mark W. Miller (talk) 07:42, 18 March 2013 (UTC)
Quit, Close, Terminate, End, Interrupt or Stop Program R
[edit]Usually if I want to quit Program R I click on the outer-most white X on a red background in the upper right corner of the R window. Alternatively I might type q() or quit() at the cursor and press Enter. However, sometimes I want to close Program R while a program, code or script is still running. In that case the above approaches might not work. In those instances, I have found that I can stop the code by using the mouse to press the 'Misc' tab at the top of the R window and then clicking either 'Stop all computations' (which seems to stop the current computation) or 'Stop current computation' (which seems to stop all computations). I might have to press 'Alt M' before the computer will allow me to click the 'Misc' tab.
See also here:
Mark W. Miller (talk) 05:52, 28 May 2013 (UTC)
Regular Expression to Replace First Character in String or Replace Last Character in String with Program R
[edit]Below is R code for regular expressions to replace the first N characters in a string or to replace the last N characters in a string. The line nn = 3 means that I am replacing three characters from a string. You can change nn to whatever number you wish. I am replacing each of nn characters with a '0'. You can replace characters with whatever symbol you wish. For example, to replace each character with a question mark symbol change rep(0, nn) to rep('?', nn).
Note that as described above, data sets posted here should be pasted into Notepad or some other editor to remove formatting before they can be read by Program R. The R code that follows the data set df.1 should run by simply copying the code from this webpage and pasting it directly into Program R.
What I have posted here is a generalization of an answer by dickoa to an earlier question of mine on Stackoverflow: http://stackoverflow.com/questions/16657366/replace-the-first-n-dots-of-a-string-revisited
df.1 <- read.table(text = '
my.string other.stuff
11111 10
..... 20
12345 30
abcde 40
zm234 50
.y900 60
qqqq. 70
', header = TRUE)
df.5 = df.4 = df.3 = df.2 = df.1
nn = 3
# Replace the first three characters of my.string:
df.2$my.string <- sub(sprintf("^.{%s}", nn), paste(rep(0, nn), collapse = ""), df.2$my.string)
# Replace the first three characters of my.string if they are a letter, number or dot:
df.3$my.string <- sub(sprintf("^[a-zA-Z0-9.]{%s}", nn), paste(rep(0, nn), collapse = ""), df.3$my.string)
# Replace the last character of my.string with three 0's:
df.4$my.string <- sub(sprintf(".${%s}", nn), paste(rep(0, nn), collapse = ""), df.4$my.string)
# Replace the last three characters of my.string with 0:
df.5$my.string <- sub(sprintf("...${%s}", nn), paste(rep(0, nn), collapse = ""), df.5$my.string)
Mark W. Miller (talk) 11:32, 6 June 2013 (UTC)
Easy Entry of Individual Covariates into Program MARKs Design Matrix
[edit]The following procedure can be used to enter individual covariates into the design matrix in Program MARK. This method may be helpful when the number of individual covariates to enter is large. Enter the individual covariates into Program Excel. Copy one column at a time from Excel. Position the mouse cursor in the cell of the design matrix of Program MARK where you want the first individual covariate in the column to go. Right click on the mouse. Select ‘Paste Clipboard’.
You can copy multiple columns at the same time from the Excel file if you wish. In that case position the mouse cursor in the upper-left-most cell of the design matrix where you wish that block of individual covariates to be positioned (for example, Cell 12 of Column B8:). Then right click on the mouse. Select ‘Paste Clipboard’.
Mark W. Miller (talk) 22:45, 14 June 2013 (UTC)
Easy Naming of Individual Covariates in Program MARK
[edit]Naming individual covariates in Program MARK can be time consuming and prone to typographic errors when creating a new file and the number of individual covariates is large. You can create a column of names in Excel then copy that column to the clipboard. Next position the mouse cursor in the cell for the name of the first individual covariate in Program MARK and click 'paste'. This will paste in the names of all the individual covariates.
This is different from, but similar to, entering names of covariates into the design matrix described above.
Mark W. Miller (talk) 23:09, 23 May 2014 (UTC)
Identify Position of First 1 by Row in R
[edit]my.data <- read.table(text = '
x1 x2 x3 x4 x5
0 1 1 2 0
0 0 1 NA 2
1 NA 1 2 0
NA NA NA NA NA
NA 0 0 1 1
0 0 1 0 0
0 0 1 NA NA
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
as.numeric(t(apply((my.data==1), 1, which.max)))
[1] 2 3 1 NA 4 3 3
--Mark W. Miller (talk) 10:26, 28 July 2015 (UTC)
rdiscrete in Stata or Mata
[edit]clear
cd C:\Users\mark.w.miller\Documents\simple_Stata_files
global wave_of_cy = 3
global wave_numtrips = 34
set obs $wave_numtrips
local wave_obs = _N
clear
use "fakedata_July1.dta", replace
putmata mydata=(napples pcount1 pcount2 pcount3), replace
mata:
`wave_obs'
"wave_of_cy"
$wave_of_cy
p = mydata[.,$wave_of_cy+1]
pp = p :/ sum(p)
rdiscrete(`wave_obs', 1, pp)
end
Mark W. Miller (talk) 15:03, 11 July 2019 (UTC)
Create Data Set in Stata and Load Data Set into Mata
[edit]clear
cd C:\Users\mark.w.miller\Documents\simple_Stata_files
input napples pcount1 pcount2 pcount3 pcount4
0 . .33263497 .22213458 .483706
1 . .18085447 .0592791 .07308878
2 . .09918489 .17381628 .06245227
3 . .12741563 .12649602 .04291577
4 . .07782318 .05691737 .05342419
5 . .03847268 .03891907 .04361086
6 . .02907396 .03870039 .03463672
7 . .00939872 .02071542 .02554072
8 . .01879743 .01602958 .01832358
9 . .01923634 .0242933 .01647887
10 . .00983762 .01197392 .03653411
11 . .01967525 .03063997 .01564033
12 . 0 .02302153 .02067067
13 . .01879743 .0234988 .01317734
14 . .00939872 .02846567 .00248811
15 . 0 .00786653 .01114919
16 . .00939872 .01106286 .00873202
17 . 0 .00353461 .00346993
18 . 0 .01751844 .00989827
19 . 0 .00731278 .00044811
20 . 0 .0041034 .00288536
21 . 0 .00064227 .00133846
22 . 0 .01064502 .00661907
23 . 0 .00933989 .0065631
24 . 0 .00209015 .00099294
25 . 0 .00236378 .00276476
26 . 0 .00709134 0
27 . 0 0 0
28 . 0 .00497991 0
29 . 0 0 0
30 . 0 .00268613 .00188223
31 . 0 .00685134 0
32 . 0 0 .00056822
33 . 0 .00701054 0
end
list
save fakedata_July1
clear
use "fakedata_July1.dta", replace
putmata mydata=(napples pcount1 pcount2 pcount3 pcount4), replace
mata : mydata
Mark W. Miller (talk) 15:08, 11 July 2019 (UTC)