Skip to main content

Linux sed remove duplicates and get unique values



Removing duplicates and getting unique values is quite easy provided that the input data follows a specific format, for example the string or raw data has spaces in between.

But a dilemma can occur if the data has no spaces in between the characters of the string, instead of spaces it is separated by dashes.

So, how to remove duplicates, get the unique values and still retain the format of the raw data?

Like this raw data: (just a sample string)
the-quick-brown-fox-jumps-over-the-lazy-dog-jumps-over-the-lazy-cat-jumps-over-the-rabbit

When removing duplicates and getting unique values via this command:

sort duplicates.txt | uniq (this will work if the data is separated by spaces)

duplicates.txt assumes that it has the string as illustrated above.

Sample output:


The output will be the exactly be the same with the input. Why? It is because the whole string is treated as literal one string, because the dashes connect between the character eliminating the space delimiter.

Example, if the requirement is to remove duplicates, get unique values and retain the dashes from the final output. How to do it?

Removing the duplicates and getting the unique values and still retain the same format, can be done in 3 process.
  1. Remove the dashes from the raw data
  2. Remove the duplicates and get unique values
  3. Put back the dashes to the final output


Here’s the Bash Shell code:

#!/bin/sh

xfile="file_with_dashes_and_duplicates.txt"

sed 's/-/ /g' "$xfile" > "$HOME/file_with_no_dashes.txt"

cat $HOME/file_with_no_dashes.txt | xargs -n1 | sort -u | xargs > file_no_duplicate.txt
file="$HOME/file_no_duplicate.txt"

sed 's/ /-/g' "$file" > "$HOME/final_output_no_duplicates_unique_content.txt"

=======
sed 's/-/ /g' == find and substitute “-“ dash with a “/ /” or a space. g == all the character that can be found on the string

sed 's/ /-/g' == find and substitute a space with a dash, g (global) all matching characters

> file_no_duplicate.txt == IO redirection to the file specified

 Output of  above script to the final data:

The script:

Raw file or input data:


File or data with no dashes:


File or data with no duplicates:


The final output:



Cheers. Enjoy Linux!!!

================================
Free Android Apps:

Click  links below to find out more:

Excel Keyboard guide:


Heaven's Dew Fall  Prayer app for Android :



Catholic Rosary Guide  for Android:
Pray the Rosary every day, countless blessings will be showered upon your life if you recite the Rosary faithfully. 
https://play.google.com/store/apps/details?id=com.myrosaryapp


Comments

Popular posts from this blog

Notepad++ convert multiple lines to a single line and vice versa

Notepad++ is an awesome text editing tool, it can accept regex to process the text data. If the data is in a “.csv” format or comma separated values which is basically just a text file that can either be opened using a text editor, excel or even word. Notepad++ can process the contents of the file using regex. Example if the data has multiple rows or lines, and what is needed is to convert the whole lines of data into a single line. Notepad++ can easily do it using regex. However, if the data is on a single line and it needs to be converted into multiple lines or rows then regex can also be used for this case. Here’s an example on how to convert multiple rows or lines into a single line. Example data: Multiple rows, just a sample data. Press Ctrl+H, and  on "Find what" type: [\r\n]+ and on "Replace with" type with: , (white space) --white space is needed if need to have a space in between the data. See image below, "Regular Expression" must be se

WMIC get computer name

WMIC get computer model, manufacturer, computer name and  username. WMIC is a command-line tool and that can generate information about computer model, its manufacturer, its username and other informations depending on the parameters provided. Why would you need a command line tool if there’s a GUI to check? If you have 20 or 100 computers, or even more. It’s quite a big task just checking the GUI to check the computer model and username. If you have remote computers, you need to delegate someone in the remote office or location to check. Or you can just write a batch file or script to automate the task. Here’s the code below on how get computer model, manufacturer and the username. Open an elevated command prompt and type:     wmic computersystem get "Model","Manufacturer", "Name", "UserName" Just copy and paste the code above, the word “computersystem” does not need to be change to a computer name. A

How to check office version from command line

The are quite a few ways to check office version it can be done via registry, PowerShell or VBScript and of course, good old command line can also do it. Checking Windows office version whether it is Office 2010, Office, 2013, Office 2016 or other version is quite important to check compatibility of documents; or just a part of software inventory. For PowerShell this simple snippet can check the office version: $ol = New-Object -ComObject Excel.Application $ol . Version The command line option will tell you where’s the path located; the result will also tell whether office is 32-bit, 64-bit and of course the version of the office as well. Here’s the command that will check the office version and which program directory the file is located which will tell whether it’s 32-bit or 64-bit. Command to search for Excel.exe: DIR C:\ /s excel.exe | find   /i "Directory of"  Above command assumes that program files is on  C: drive. Sample Outpu